From boconnor at ucla.edu  Wed Mar  1 16:34:38 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Wed, 01 Mar 2006 13:34:38 -0800
Subject: [DAS2] Re: Re DAS2 Server
In-Reply-To: <C02B5657.A1E%vedupuganti@tgen.org>
References: <C02B5657.A1E%vedupuganti@tgen.org>
Message-ID: <4406136E.6060703@ucla.edu>

Hi Vidya,

So I think your best option is to try the RPM.  I built a Fedora Core 2 
RPM for DAS2 and just released it to http://biopackages.net last night.  
I could really use someone to test it so feedback would be great.  The 
RPM approach is nice because yum will take care of installing all the 
dependencies including the chado database.

If you're not using FC2 then it's a little but more involved.  We don't 
really have a lot of docs but I could update the README in cvs (see 
http://sourceforge.net/projects/gmod it's the "das2" module).   Until 
recently there wasn't  really an install process you just do a "perl 
Makefile.PL; make; make test" to run DAS2.  There's now an "install" 
target so you can do "perl Makefile.PL; make; sudo make install".  You 
need to set some environmental variables, install a chado DB, and make 
sure all the perl module dependencies are installed before you do this 
though.  See the Makefile.PL for the environmental variables you need to 
set.  I'll update the README to include information about the dependencies.

Hope this helps!  I cc'd Allen Day too, he might have some helpful hints...

--Brian

Vidya Edupuganti wrote:

>Hi Brian,
>I am trying to setup DAS/2 server so that it can be used with Affymetrix's
>IGB browser. I was trying to find a user manual for setting up DAS/2 server.
>I could not find any. Can you please direct me to a place where I can find
>it. If there isn't  any can you please give me some inputs on how to install
>a DAS/2 server and load data.
>I really appreciate your help,
>Thanks
>Vidya
>
>
>
>
>Vidyadari Edupuganti
>Bioinformatician, Bioinformatics Research Unit
>The Translational Genomics Research Unit (TGen)
>445 N. Fifth St
>Phoenix, AZ, 85004, USA
>
>
>
>  
>


From dalke at dalkescientific.com  Fri Mar  3 04:55:02 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 3 Mar 2006 02:55:02 -0700
Subject: [DAS2] working das validator
Message-ID: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com>

I have a running validator at

http://cgi.biodas.org:8080/


I've only tested it with SOURCES document but there's little
that would fail with the others.

I had planned to get this up a couple days ago but I've been
distracted learning more about Javascript and a couple of Javascript
libraries.  I used Mochikit to make the interactivity you see
there, and I have some ideas about how to use Dojo -- but not
for a couple of weeks.

The code goes through the following validation steps:

  - TODO - handle if the URL is not fetchable and handle timeouts
  - check that the content-type agrees with the document type
  - check that it's well-formed XML; report error where not
  - check that the root element matches the document type
  - check that it passed the Relax-NG validation;
  - report the id and href fields which are empty strings
  - report if any date fields are not iso dates

There are many more checks I could add.  They are easy now
that the scaffold is there.

I'm going to work on the next draft now.

After that I'll get back to the validator.  I want to add
hyperlinks on fields which are links, and I have an idea of
how to add a "SEARCH" button next to the query urls which
creates a popup where you can fill in the different fields
before doing the search.

Budget-wise I'm not sure how to charge the last few days
of work as it was a "wouldn't it be neat if" project rather
than something really needed.  It is neat though ...

					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Fri Mar  3 12:34:11 2006
From: Steve_Chervitz at affymetrix.com (Chervitz, Steve)
Date: Fri, 3 Mar 2006 09:34:11 -0800
Subject: [DAS2] working das validator
In-Reply-To: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com>
Message-ID: <C02DBE13.1CB21%Steve_Chervitz@affymetrix.com>

Andrew,

Nice work on the web interface to the validator. Before you dive back into
the spec, could you troubleshoot these 500 errors I'm getting on your
server?

URL: http://das.biopackages.net/das/genome

With the "guess" radio button I get:

   500 Internal error
   ....
   TypeError: GuessFromHeader() takes exactly 2 arguments (1 given)

With any other radio button I get:

   500 Internal error
   ....
   AttributeError: BodyError instance has no attribute 'args'

Steve

> From: Andrew Dalke <dalke at dalkescientific.com>
> Date: Fri, 3 Mar 2006 02:55:02 -0700
> To: DAS/2 <das2 at portal.open-bio.org>
> Subject: [DAS2] working das validator
> 
> I have a running validator at
> 
> http://cgi.biodas.org:8080/
> 
> 
> I've only tested it with SOURCES document but there's little
> that would fail with the others.
> 
> I had planned to get this up a couple days ago but I've been
> distracted learning more about Javascript and a couple of Javascript
> libraries.  I used Mochikit to make the interactivity you see
> there, and I have some ideas about how to use Dojo -- but not
> for a couple of weeks.
> 
> The code goes through the following validation steps:
> 
>   - TODO - handle if the URL is not fetchable and handle timeouts
>   - check that the content-type agrees with the document type
>   - check that it's well-formed XML; report error where not
>   - check that the root element matches the document type
>   - check that it passed the Relax-NG validation;
>   - report the id and href fields which are empty strings
>   - report if any date fields are not iso dates
> 
> There are many more checks I could add.  They are easy now
> that the scaffold is there.
> 
> I'm going to work on the next draft now.
> 
> After that I'll get back to the validator.  I want to add
> hyperlinks on fields which are links, and I have an idea of
> how to add a "SEARCH" button next to the query urls which
> creates a popup where you can fill in the different fields
> before doing the search.
> 
> Budget-wise I'm not sure how to charge the last few days
> of work as it was a "wouldn't it be neat if" project rather
> than something really needed.  It is neat though ...
> 
> Andrew
> dalke at dalkescientific.com
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Fri Mar  3 13:04:12 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 3 Mar 2006 11:04:12 -0700
Subject: [DAS2] working das validator
In-Reply-To: <C02DBE13.1CB21%Steve_Chervitz@affymetrix.com>
References: <C02DBE13.1CB21%Steve_Chervitz@affymetrix.com>
Message-ID: <5d7729f77f8d4b6dcbd8dacd04701c19@dalkescientific.com>

Hi Steve,

   I saw those errors in the log file but wasn't sure if they were
from you or Gregg.

> URL: http://das.biopackages.net/das/genome
>
> With the "guess" radio button I get:
>
>    500 Internal error
>    ....
>    TypeError: GuessFromHeader() takes exactly 2 arguments (1 given)

Fixed.

> With any other radio button I get:
>
>    500 Internal error
>    ....
>    AttributeError: BodyError instance has no attribute 'args'

Fixed.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Sat Mar  4 20:59:15 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Sat, 4 Mar 2006 18:59:15 -0700
Subject: [DAS2] current text of draft 3 of spec
Message-ID: <5e3c38635022ba8ae291cd6c4e036eef@dalkescientific.com>

I've been working on the 3rd draft for the spec.  Because of
the confusion in the previous version I've decided on a
different approach where I jump into the middle and describe
how the parts fit together before getting into the details of
every element type or the theory behind the architecture.

I think this flows much better.

====================

DAS is a protocol for sharing biological data.  This version of the
specification, DAS 2.0, describes features located on the genomic
sequence.  Future versions will add support for sharing annotations of
protein sequences, expression data, 3D structures and ontologies.  The
genomic DAS interface is deliberately designed so there will be a
large core shared with the protein sequence DAS.

A DAS 2.0 annotation server provides feature information about one or
more genome sources.  Each source may have one or more versions.
Different versions are usually based on different assemblies.  As an
implementation detail an assembly and corresponding sequence data may
be distributed via a different machine, which is called the reference
server.

Annotations are located on the genomic sequence with a start and end
position.  The range may be specified multiple times if there are
alternate coordinate systems.  An annotation may contain multiple
non-continguous parts, making it the parent of those parts.  Some
parts may have more than one parent.  Annotations have a type based on
terms in SOFA (Sequence Ontology for Feature Annotation).  Stylesheets
contain a set of properties used to depict a given type.

Annotations can be searched by range, type, and a properties table
associated with each annotation.  These are called feature filters.

DAS 2.0 is implemented using a ReST architecture.  Each document (also
called an entity or object) has a name, which is a URL.  Fetching the
URL gets information about the document.  The DAS-specific documents
are all in XML.  Other data types have existing widely used formats,
and sometimes more than one for the same data.  A DAS server may
provide a distinct document for each of these formats, along with
information about which formats are available.

DAS 2.0 addresses some shortcomings of the DAS 1.x protocol, including:

  * Better support for hierachical structures (e.g. transcript + exons)

  * Ontology-based feature annotations

  * Allow multiple formats, including formats only appropriate for
    some feature types

  * A lock-based editing protocol for curational clients

  * An extensible namespacing system that allows annotations in
   non-genomic coordinates (e.g. uniprot protein coordinates or PDB
   structure coordinates)

=====

A DAS server supplies information about genomic sequence data sources.
The collection of all sources, each data source, and each version of a
data source are accessible through a URL.  All three classes of URLs
return a document of content-type 'application/x-das-sources+xml'
though likely with differing amounts of detail.  A 'versioned source'
request returns information only about a specific version of a data
source.  A 'source' request returns the list of all the versioned
source data for that source.  A 'sources' request returns the list of
all the source data, including all the versioned source data.

The URLs might not be distinct.  For example, a server with only one
version of one data source may use the same URL for all three
documents, and a server for a single organism may use the same URL for
the 'sources' and 'source' documents.

Most servers will list only the data sources provided by that server.
Some servers combine the sources documents from other servers into a
single document.  These registry servers act as a centralized index
and reduce configuration and network overhead.  A registry server uses
the same sources format as an annotation server.

Here is an example of a simple sources document which makes no
distinction between the three sources categories.


Request:

http://www.example.com/das/genome/yeast.xml

Response:

Content-Type: application/x-das-sources+xml

<?xml version="1.0" encoding="UTF-8"?>
<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.com/das/genome/">

   <SOURCE id="yeast.xml" title="Saccharomyces cerevisiae (Baker's  
yeast) genome"
          doc_href="http://www.example.com/yeast.html">
     <VERSION id="yeast.xml" created="2005-12-05">
       <COORDINATES taxid="4932" source="Gene_ID" authority="SGD32" />
       <CAPABILITY type="features" query_id="features.xml" />
       <CAPABILITY type="types" query_id="types.xml"/>
     </VERSION>
   </SOURCE>

</SOURCES>

All identifiers and href attributes in DAS documents follow the XML
Base specification (see http://www.w3.org/TR/xmlbase/ ) in resolving
partial identifiers and href attributes.  In this case the id
"yeast.xml" is fully resolved to
"http://www.example.com/das/genome/yeast.xml".


Here is an example of a more complicated sources document with
multiple organisms each with multiple versions.  Each of the two
source documents (one for each organism) has a distinct URL as does
each of the version for each organism.  This is a pure registry server
because the actual annotation data comes from other machines.

Request:
   http://www.biodas.org/known_servers

Response:

Content-Type: application/x-das-sources+xml

<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00">
   <SOURCE id="http://das.ensembl.org/das/SPICEDS/"  
title="das_vega_trans">
     <VERSION id="http://das.ensembl.org/das/SPICEDS/127/"  
created="2005-05-23">
       <MAINTAINER email="someone at sanger.ac.uk" />
       <COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
                    test_range="BX255914" />
       <CAPABILITY types="segments"
               
query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
       <CAPABILITY type="features"
            query_id="http://das.ensembl.org/das/SPICEDS/127/features" />
         <SUPPORTS name="das2queries" />
       </CAPABILITY>
       <CAPABILITY type="types"
            query_id="http://das.ensembl.org/das/SPICEDS/127/types" />
     </VERSION>

     <VERSION id="http://das.ensembl.org/das/SPICEDS/128/"  
created="2005-08-13">
       <MAINTAINER email="someone-else at sanger.ac.uk" />
       <COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
                    test_range="BX255914" />
       <CAPABILITY type="segments"
               
query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
       <CAPABILITY type="features"
            query_id="http://das.ensembl.org/das/SPICEDS/128/features" />
         <SUPPORTS name="das2queries" />
       </CAPABILITY>
       <CAPABILITY type="types"
            query_id="http://das.ensembl.org/das/SPICEDS/128/types" />
       <CAPABILITY type="locks"  
url="http://das.ensembl.org/das/SPICEDS/128/locks" />
       <CAPABILITY type="writeback"
                 url="http://das.ensembl.org/das/SPICEDS/128/locks" />
     </VERSION>
   </SOURCE>

   <SOURCE id="http://www.example.com/das2/mus/sources.xml" title="Mus  
musculus">
     <VERSION id="http://www.example.com/das2/mus/42/sources.xml"  
created="2006-02-11">
       <MAINTAINER email="pied-piper at hamlet.ac.uk" />
       <COORDINATES taxid="10090" source="Clone" authority="Ensembl"
                 test_range="AL935121" />
       <CAPABILITY type="features"
             
query_id="http://www.example.com/cgi-bin/features-mus-v42.cgi">
         <SUPPORTS name="das2queries" />
       </CAPABILITY>
       <CAPABILITY type="types"
            query_id="http://www.example.com/das2/mus/v42/types.xml" />
     </VERSION>
   </SOURCE>
</SOURCES>

Each SOURCE id and VERSION id is individually fetchable so the URL
"http://das.ensembl.org/das/SPICEDS/" returns a sources document with
the SOURCE record for "das_vega_trans" and both of its VERSION
subelements while "http://das.ensembl.org/das/SPICEDS/128/" returns a
sources document with only the second of its VERSION subelements.

DAS documents refer to other documents through URLs.  There are no
restrictions on the internal form of the URLs, other than the query
string portion.  Server implementers are free to choose URLs which
best fit the architecture needs.  For example, a simple DAS server may
be implemented as a set of XML files hosted by a standard web server
while more complex servers with search support may be implemented as
CGI scripts or through embedded web server extensions.  The URLs do
not need to define a hierarchical structure nor even be on the same
machine. Compare this to the DAS1 specification where some URLs were
constructed by direct string modification of other URLs.

=====

Each versioned source contains a set of segments. A segment is the
largest chunk of contiguous sequence. For fully sequenced organisms a
segment may be a chromosome.  For partially assembled genomes where
the distance between the assembled regions is not known then each
region may be its own segment.  If a server provides annotations in
contig space then each contig is a segment.  Feature locations are
specified on ranges of segments which is why a specific set of
segments is called a coordinate system.  [coordinate-system] This
specification does not describe how to do alignments between different
coordinate systems.


The sources document format has two ways to describe the coordinate
system.  The optional COORDINATES element uniquely characterize the
coordinate system.  If two data sources have the same authority and
source values then they must be annotations on the same coordinate
system.  The specific coordinate system is also called the "reference
sequence".

A versioned source may contain CAPABILITY elements which describe
different ways to request additional data from a DAS server.  Each
CAPABILITY has a type that describes how to use the corresponding URL
to query a DAS server.  A CAPABILITY element of type "segments" has a
query URL which returns a document of content-type
"application/x-das-segments+xml".  A segments document lists
information about the segments in the coordinate system.  Here is an
example of a segments document.

Request:

http://www.biodas.org/das2/h.sapiens/v3/segments.xml

Response:

Content-Type: application/x-das-segments+xml

<?xml version="1.0" encoding="UTF-8"?>
<SEGMENTS xmlns="http://www.biodas.org/ns/das/genome/2.00">
  <SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr1.xml"
      name="Chr1" length="245522847"
      doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=1"/>
  <SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr2.xml"
      name="Chr2" length="243018229"
      doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=2"/>
</SEGMENTS>

=====

The versioned source record for an annotation server must include a
CAPABILITY of type "features".  A client may use the query URL from
the features CAPABILTY points to select features which match certain
criteria.  If no criteria are specified the server must return all
features unless there are too many features to return.  In that case
it must respond with an error message.

Unless an alternate format is specified, the response from the
features query is a document of content-type
"application/x-das-features+xml" containing all of the matching
features.  Here is an example features document for a server which
contains a gene and an alignment.

Request:

http://das.biopackages.net/das/genome/yeast/S228C/features.pl

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.org/volvox/1/">
  <FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
    <LOC segment="Chr2/1200:2917:1" />
  </FEATURE>

  <FEATURE id="feature/hit12"
           type_id="type/est-alignment"
           created="2001-12-15T22:43:36"
           modified="2004-09-26T21:10:15" >

    <LOC segment="Chr3/1201:1400:1" />
    <PART id="feature/hit12.hsp1" />
    <PART id="feature/hit12.hsp2" />
    <ALIGN target_id="feature/yk12391" range="200:299" />
    <PROP key="est2genomescore" value="180" />
  </FEATURE>

  <FEATURE id="feature/hit12.hsp1"
           type_id="type/est-alignment-hsp">
    <LOC segment="Chr3/1201:1250:-1" />
    <PARENT id="feature/hit12"/>
    <ALIGN target_id="feature/yk12391" range="1:52" gap="M49 D1 M1"/>
    <PROP  key="est2genomescore" value="180" />
  </FEATURE>

  <FEATURE id="feature/hit12.hsp2"
           type_id="type/est-alignment-hsp" >
    <LOC segment="Chr3/1351:1400:1" />
    <PARENT id="feature/hit12" />
    <ALIGN target_id="feature/yk12391" range="53:100" gap="M20 D1 G1  
M30" />
    <PROP  key="est2genomescore" value="120" />
  </FEATURE>

</FEATURES>

Each feature has a unique identifier and an identifer linking it to a
type record.  Both identifiers are URLs and should be directly
fetchable.  Simple features can be located on a region of a segment.
More complex features like a gapped alignment are represented through
a parent/part relationship.  A feature may have multiple parents and
multiple parts.

=====

An annotation server may contain many features while the client may
only be interested in a subset; most likely features in a given
portion of the reference sequence.  To help minimize the bandwidth
overhead the feature query URL should support the DAS feature filter
language.  The syntax uses the standard HTML form-urlencoded GET query
syntax.  For example, here is a request for all features on Chr2.

Request:

http://www.example.org/volvox/1/features.cgi?inside=Chr2

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.org/volvox/1/">
  <FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
    <LOC segment="Chr2/1200:2917:1" />
  </FEATURE>

  <FEATURE id="feature/hit12"
           type_id="type/est-alignment"
           created="2001-12-15T22:43:36"
           modified="2004-09-26T21:10:15" >

    <LOC segment="Chr3/1201:1400:1" />
    <PART id="feature/hit12.hsp1" />
    <PART id="feature/hit12.hsp2" />
    <ALIGN target_id="feature/yk12391" range="200:299" />
    <PROP key="est2genomescore" value="180" />
  </FEATURE>
</FEATURES>

and here is the rather long one for all EST alignments

Request:

http://www.example.org/volvox/1/features.cgi? 
type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment

Response:

Content-Type: application/x-das-features+xml

<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.org/volvox/1/">
  <FEATURE id="feature/hit12"
           type_id="type/est-alignment"
           created="2001-12-15T22:43:36"
           modified="2004-09-26T21:10:15" >

    <LOC segment="Chr3/1201:1400:1" />
    <PART id="feature/hit12.hsp1" />
    <PART id="feature/hit12.hsp2" />
    <ALIGN target_id="feature/yk12391" range="200:299" />
    <PROP key="est2genomescore" value="180" />
  </FEATURE>
</FEATURES>

=====

All features are linked to a type record.  DAS types do not describe a
formal type system in that DAS types do not derive from other DAS
types.  Instead it links to an external ontology term and describes
how to depict features of that type.

A DAS annotation server must contain a CAPABILITY element of type
"types".  A client may use its query URL to fetch a document of
content-type "application/x-das-types+xml". The document lists all of
the types available on the server.  We expect that servers will have
at most a few dozen types so DAS does not support type filters.

The following is a hypothetical example of a DAS annotation server
providing GENSCAN gene predictions for zebrafish.  Each feature is
either of type
"http://www.example.org/das/zebrafish/build19/high-type" or
"http://www.example.org/das/zebrafish/build19/low-type" depending on
if the data provider determined it was a high probability or low
probability prediction.  Even though there are two different type
records the refer to the same ontology term, in this case the SO term
for "gene".  The distinction exists so that the high probability
features are depicted differently from the low probability features.

Request:

http://www.example.org/das/zebrafish/build19/types

Response:

Content-Type: application/x-das-types+xml

<TYPES xmlns="http://www.biodas.org/ns/das/genome/2.00"
        xml:base="http://www.example.org/das/zebrafish/build19/">
   <TYPE id="high-type" title="High probability gene predictions"
        
doc_href="http://www.example.org/docs/genscan_prediction.html#high"
       source="GENSCAN 1.0"
        
ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
       accession="SO:0000704"
     <STYLE>
       <BOX fgcolor="red" border_width="1"/>
     </STYLE>
   </TYPE>
   <TYPE id="low-type" title="Low probability gene predictions"
       doc_href="http://www.example.org/docs/genscan_prediction.html#low"
       source="GENSCAN 1.0"
        
ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
       accession="SO:0000704"
     <STYLE>
       <BOX fgcolor="yellow" border_width="1"/>
     </STYLE>
   </TYPE>

</TYPES>


[coordinate-system]

We make a distinction between "coordinate system" and "numbering
system".  The coordinate system is the set of segment on which
features are located.  The numbering system describes how to identify
the specific residues in the segment.  DAS uses a 0-based coordinate
system where the first residue is numbered "0", the second "1", and so
on.  Other numbering systems include 1-based coordinates and the PDB
numbering system which preserves the residue number for key residues
across homologous family by allowing discontinuities, insertions and
negative values as position numbers.


					Andrew
					dalke at dalkescientific.com


From nomi at fruitfly.org  Mon Mar  6 03:09:22 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 6 Mar 2006 00:09:22 -0800 (PST)
Subject: [DAS2] DAS/2 teleconference?
Message-ID: <17419.60978.358549.246997@kinked.lbl.gov>

Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
happen.
        Nomi


From dalke at dalkescientific.com  Mon Mar  6 04:14:30 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 6 Mar 2006 02:14:30 -0700
Subject: [DAS2] DAS/2 teleconference?
In-Reply-To: <17419.60978.358549.246997@kinked.lbl.gov>
References: <17419.60978.358549.246997@kinked.lbl.gov>
Message-ID: <fd0f45acc7110c7d18f8c8a9f7fbe39d@dalkescientific.com>

Nomi:
> Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
> happen.

I plan on calling in.


					Andrew
					dalke at dalkescientific.com


From Gregg_Helt at affymetrix.com  Mon Mar  6 09:03:24 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 06:03:24 -0800
Subject: [DAS2] DAS/2 teleconference?
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA0A@msex02.affymetrix.com>

Apologies for the mixup with the teleconference last week!  Yes we're
definitely on for a teleconference today at the standard time, 9:30 AM
Pacific time.

	Thanks,
	Gregg

> -----Original Message-----
> From: das2-bounces at portal.open-bio.org
[mailto:das2-bounces at portal.open-
> bio.org] On Behalf Of Nomi Harris
> Sent: Monday, March 06, 2006 12:09 AM
> To: DAS/2
> Subject: [DAS2] DAS/2 teleconference?
> 
> Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
> happen.
>         Nomi
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2


From lstein at cshl.edu  Mon Mar  6 09:49:18 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 6 Mar 2006 09:49:18 -0500
Subject: [DAS2] DAS/2 teleconference?
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0A@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0A@msex02.affymetrix.com>
Message-ID: <200603060949.19299.lstein@cshl.edu>

Hi Gregg,

I'll miss the first half hour of the call today because of an overlap with an 
NCI teleconference.

Lincoln

On Monday 06 March 2006 09:03, Helt,Gregg wrote:
> Apologies for the mixup with the teleconference last week!  Yes we're
> definitely on for a teleconference today at the standard time, 9:30 AM
> Pacific time.
>
> 	Thanks,
> 	Gregg
>
> > -----Original Message-----
> > From: das2-bounces at portal.open-bio.org
>
> [mailto:das2-bounces at portal.open-
>
> > bio.org] On Behalf Of Nomi Harris
> > Sent: Monday, March 06, 2006 12:09 AM
> > To: DAS/2
> > Subject: [DAS2] DAS/2 teleconference?
> >
> > Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
> > happen.
> >         Nomi
> >
> > _______________________________________________
> > DAS2 mailing list
> > DAS2 at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/das2
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From Gregg_Helt at affymetrix.com  Mon Mar  6 11:44:43 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 08:44:43 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>

upcoming Code Sprint, March 13-17 at Affymetrix
status reports
 
coordinate system resolution via COORDINATES element
features with multiple locations vs. alignments
features with multiple parents
???
 

From lstein at cshl.edu  Mon Mar  6 12:37:39 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 6 Mar 2006 12:37:39 -0500
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
Message-ID: <200603061237.41288.lstein@cshl.edu>

Hi,

The teleconference system now asks me for a passcode. Previously I just had to 
enter the conference ID. What's up?

Lincoln

On Monday 06 March 2006 11:44, Helt,Gregg wrote:
> upcoming Code Sprint, March 13-17 at Affymetrix
> status reports
>
> coordinate system resolution via COORDINATES element
> features with multiple locations vs. alignments
> features with multiple parents
> ???
>
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From Gregg_Helt at affymetrix.com  Mon Mar  6 12:38:37 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 09:38:37 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA0D@msex02.affymetrix.com>

Please try again, it shouldn't ask for a passcode, but if it does, it's
1365.  There may be some glitch in our teleconferencing...

	Thanks,
	Gregg

> -----Original Message-----
> From: Brian O'Connor [mailto:boconnor at ucla.edu]
> Sent: Monday, March 06, 2006 9:36 AM
> To: Helt,Gregg
> Cc: das2 at portal.open-bio.org
> Subject: Re: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
> 
> Hi Gregg,
> 
> I tried calling in to the DAS conference call but it asked for a
> passcode in addition to the conference ID.  All I have is the
conference
> ID...
> 
> --Brian
> 
> Helt,Gregg wrote:
> 
> >upcoming Code Sprint, March 13-17 at Affymetrix
> >status reports
> >
> >coordinate system resolution via COORDINATES element
> >features with multiple locations vs. alignments
> >features with multiple parents
> >???
> >
> >
> >_______________________________________________
> >DAS2 mailing list
> >DAS2 at portal.open-bio.org
> >http://portal.open-bio.org/mailman/listinfo/das2
> >
> >
> >


From nomi at fruitfly.org  Mon Mar  6 12:40:26 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 6 Mar 2006 09:40:26 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
Message-ID: <17420.29706.575212.913804@spongecake.lbl.gov>

i am calling in (800-531-3250, id: 2879055) but it is then asking me for
a passcode.  i tried entering 2879055 again but that didn't work.  we
didn't used to have a passcode, did we?  can someone tell me what it is?
if you prefer not to email it, you can phone me at 510 486-5078.
     Nomi


From Gregg_Helt at affymetrix.com  Mon Mar  6 13:10:23 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 10:10:23 -0800
Subject: [DAS2] Examples of features with multiple locations from
	biopackages server
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA10@msex02.affymetrix.com>

In the teleconference today, we?re talking about features with multiple locations, here?s an example from biopackages server:
 
<FEATURE id="feature/Affymetrix_YG-S98:3589_f_at" type="type/SO:PCR_product" name="Affymetrix_YG-S98:3589_f_at">
<LOC id="segment/chrVII" range="114942:536087:1"/>
<LOC id="segment/chrVII" range="540819:876266:1"/>
<LOC id="segment/chrVII" range="561866:562768:-1"/>
<LOC id="segment/chrVII" range="567445:569664:-1"/>
<LOC id="segment/chrVII" range="567445:818311:-1"/>
<LOC id="segment/chrVII" range="816491:932008:1"/>
</FEATURE>
?
            <FEATURE id="feature/Affymetrix_YG-S98:3750_f_at" type="type/SO:PCR_product" name="Affymetrix_YG-S98:3750_f_at">
<LOC id="segment/chrVII" range="327915:402221:1"/>
<LOC id="segment/chrVII" range="561919:562068:-1"/>
<LOC id="segment/chrVII" range="811564:811720:1"/>
<LOC id="segment/chrVII" range="823052:823197:-1"/>
</FEATURE>


From boconnor at ucla.edu  Mon Mar  6 12:36:28 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Mon, 06 Mar 2006 09:36:28 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
Message-ID: <440C731C.5070303@ucla.edu>

Hi Gregg,

I tried calling in to the DAS conference call but it asked for a 
passcode in addition to the conference ID.  All I have is the conference 
ID...

--Brian

Helt,Gregg wrote:

>upcoming Code Sprint, March 13-17 at Affymetrix
>status reports
> 
>coordinate system resolution via COORDINATES element
>features with multiple locations vs. alignments
>features with multiple parents
>???
> 
>
>_______________________________________________
>DAS2 mailing list
>DAS2 at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/das2
>
>  
>


From dalke at dalkescientific.com  Mon Mar 13 09:00:45 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 06:00:45 -0800
Subject: [DAS2] format information for the reference server
Message-ID: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com>

(NOTE: the open-bio mailing lists were moved from portal.open-bio.org
to lists.open-bio.org.  My first email on this bounced because I
sent to the old email address.)

Summary of questions:
   - what does it mean for the annotation server to list the formats
       available from the reference server?
   - can the reference server format information be moved to the
       segments document?
   - are there formats which will only work at the segment level and
       not at the segments level (ie, formats which don't handle multiple
       records)?

Something's been bothering me about the segments request.

Currently the DAS sources request responds with something like

<SOURCES>
   <SOURCE>
    <VERSION>
      <CAPABILITY type="segments" query_url="http://blah/seq">
         <FORMAT name="fasta" />
         <FORMAT name="agp" />
      </CAPABILITY>
   ...
</SOURCES>

This says "go to 'blah' for information about the sequence".

But it says more than that.  It provides metadata about
the reference server.  It says that the reference server can
respond in 'fasta' and 'agp' formats.

Hence the following are allowed from this URL

   http://blah/seq?format=agp  -- return the assembly
   http://blah/seq?format=fasta -- return all sequences in FASTA format

Does this mean that all annotations servers using the given
reference server must list all of the available formats?

If a client sees multiple CAPABILITY elements for the same
query_url is it okay to merge the list of supported formats?
That is, if server X says that annotation server A supports
fasta and server Y says that A supports genbank then a client
may assume A supports both fasta and genbank formats?
(This makes sense to me.)

Second, does it make sense to require the annotation servers
to list the formats on the reference server?  What about
making that information available from the segments document,
like this.

query:

   http://www.biodas.org/das/h.sapiens/38/segments.cgi

response:

<SEGMENTS>
   <SEGMENT id="abc">
     <FORMAT name="fasta" />
     <FORMAT name="agp" />
   </SEGMENT>
   <SEGMENT id="def">
     <FORMAT name="fasta" />
     <FORMAT name="agp" />
   </SEGMENT>
</SEGMENT>

A problem with this the lack of data saying that the
segments query URL itself supports multiple formats.  For
example,

   http://www.biodas.org/das/h.sapiens/38/segments.cgi?format=fasta

might support returning all of the chromosomes in FASTA format.

Are there any formats which only work at the segment level
and not at the segments level?  That is, which only work with
single gene/chromosome/contig/etc. but don't support multiple
sequences?  The only one I could think of off-hand is "raw",
since there's no concept of a "record" given a bunch of letters,
unless the usual way is to separate them by an extra newline?

If all formats are supported for both single and all segments
then here is another possible response

[possibility #1]
<SEGMENTS>
   <FORMAT name="fasta" />
   <FORMAT name="agp" />
   <SEGMENT id="abc" />
   <SEGMENT id="def" />
</SEGMENT>

I think all formats which work on the "segments" level also
work on a single segment level, so another possibility is
the following, which lets a given segment say that it supports
more formats.

[possibility #2]
<SEGMENTS>
   <FORMAT name="fasta" />
   <FORMAT name="agp" />
   <SEGMENT id="abc">
     <FORMAT name="raw" />
   </SEGMENT>
   <SEGMENT id="def" />
     <FORMAT name="raw" />
   </SEGMENT>
</SEGMENT>


Here's another, using a flag to say if a format is for a
single segment, the segments URL, or both (feel free to
pick better names!). By default it applies to both.

[possibility #3]

<SEGMENTS>
   <!-- both support FASTA retrieval -->
   <FORMAT name="fasta" />

   <!-- both support GenBank retrieval -->
   <FORMAT name="genbank" applies-to="both" />

   <!-- can only get the assembly of everything -->
   <FORMAT name="agp" applies-to="segments" />

   <!-- can only get the raw sequence for a segment -->
   <FORMAT name="raw" applies-to="segment" />
</SEGMENT>

Yet another option is

[possibility #4]
<SEGMENTS>
   <FORMATS-FOR-SEGMENTS>
     <FORMAT name="fasta" />
     <FORMAT name="genbank" />
     <FORMAT name="agp" />
   </FORMATS-FOR-SEGMENTS/>
   <FORMATS-FOR-SINGLE-SEGMENT>
     <FORMAT name="fasta" />
     <FORMAT name="genbank" />
     <FORMAT name="raw" />
   </FORMATS-FOR-SEGMENTS/>
   ..

Of these I support [possibility #1], with the ability to go
to [possibility #3] if there's ever a case where a given format
cannot be applied to both levels.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Mon Mar 13 09:29:28 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 06:29:28 -0800
Subject: [DAS2] id, url, uri, and iri
Message-ID: <cd2e10567561e030affb493ad1d55aba@dalkescientific.com>

Something to settle.

I've been using 'id' like this

>  <FEATURE id = "feature/hit12"
>           type_id = "type/est-alignment"
>           created = "2001-12-15T22:43:36"
>           modified = "2004-09-26T21:10:15" >
>
>    <LOC id="residues/Chr3" range="1201:1400:1" />
>    <PART id="feature/hit12.hsp1" />
>    <PART id="feature/hit12.hsp2" />
>    <ALIGN target_id="feature/yk12391" range="200:299" />

As Dave Howorth pointed out, most people use 'id' as an
in-document identifier, and not as an identifier to link
to other documents.  Eg, there's a "getElementById()" method
in the DOM which is mean to find DOM nodes given the id.

In looking around I found that it's keyed off of the type
(as determined by the schema) and not by the string 'id'.
I added 'xml:id' as a possible DAS attribute, which is defined
by the XML spec to work as expected for getElementById.

In private email Gregg asked about using 'uri' instead of
'id' for this.

I'm now leaning that way.  Either 'uri' or 'url' or 'iri'.
I prefer url because everyone knows what that means.  Gregg
prefers 'uri' I think because that's what allows fragment
identifiers, and because it includes things which are other
than URLs, like LSIDs.

However, the latest thing these days is an "iri" which
means "internationalized resource identifier"
   http://www.ietf.org/rfc/rfc3987.txt

I haven't read enough of it to understand it.  My first
attempt says that it's okay to use "uri" because there
are 1-to-1 mappings between uris and iris.  Also, I don't
want to test bidirectional text and I suspect there isn't yet
widely used library support for iris.

So I want to change the DAS use of 'id' to 'url' and say
"the value of the 'url' attribute is a URI".


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Mon Mar 13 10:38:58 2006
From: Steve_Chervitz at affymetrix.com (Chervitz, Steve)
Date: Mon, 13 Mar 2006 07:38:58 -0800
Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 6 Mar 2006
Message-ID: <C03AD212.1CEBE%Steve_Chervitz@affymetrix.com>

[These are notes from last week's meeting. -Steve]

Notes from the weekly DAS/2 teleconference, 6 Mar 2006

$Id: das2-teleconf-2006-03-06.txt,v 1.1 2006/03/13 15:41:03 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  CSHL: Lincoln Stein
  Sanger: Thomas Down
  Dalke Scientific: Andrew Dalke
  UC Berkeley: Nomi Harris
  UCLA: Brian O'Connor
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Agenda:
-------

upcoming Code Sprint, March 13-17 at Affymetrix
status reports
coordinate system resolution via COORDINATES element
features with multiple locations vs. alignments
features with multiple parents
???

[ Some trouble with passcode for teleconf - hopefully fixed ]

TD: The coord syst things we were hoping to discuss with Andreas who
won't make it today.
GH: We can push this off till next week.

Code Sprint
-------------
LS: At sanger mon-tues for ensembl sab meeting, able to participate
from tues pm to fri eve.
AD: Planning to come to Affy
BO: Allen and I are planning to come up to Emeryville
GH: For payment, submit expenses to affy.
Hotels? Marriott or Woodfin. Will send out rec's today.
NH: Planning to attend at affy mon-tues, thur.

[A] Ed will look into accts for andrew and brian (internet access)

GH: Plan on 9-10am phone teleconf daily. Greg can pick up people from
hotel. 

GH: Goals/deliverables for this code sprint?
LS: Write das/2 client for bioperl. Plan to plug into Gbrowse
All I need is a working server
AD: Writing writeback and locks, improving validator .
NH: Apollo and registry, feature types. Wrote a writer, can test in
AD's validator (plan to).
GH: Keep working on das/2 client for igb at affy. Hoping by then to
have an affy das/2 server up and running.
SC: Can help get it up
GH: Can we put on in our dmz, so it's publically accessible at least
for the code sprint.

[A] Steve will look into setting up publically accessible affy das/2 test
server 

TD: Working on getting an Ensembl das/2 server up.
GH: Java middle ware on top of biojava?
TD: Yes. Using the biojava to ensembl bridges.
EE: Getting IGB to use style sheets.
AD: And/or using a proper style sheet system, if you decide what I put
in there is not good enough.
BO: Looking for something to do. Hoping to start on writeback.,
Helping separate out igb model layer. Finished rpm packages in last
code sprint, this is pretty much done.
GH: Guess Allen will be working on the biopackages server.
BO: Waiting on spec for writeback.
AD: My writeup specifies how they do writeback at Sanger, overlaps
well with Lincoln's proposal. See that.

GH: Need to tighten up the read-only spec. A fair number of things to
resolve.
AD: A partial draft of 3rd version. Planning to update it before next
sprint. Examples so people can get a feel for how things go together.

GH: My agenda stuff: coord system resolution system to match
annotations on same genome coming from diff servers.

[A] Gregg will wait for Andreas to join in before discussing coordinate
issues.

GH: Feats w/ multiple locations (see email Gregg sent to the list
today with examples). Current spec says if you use >1 coord
system, you can have feats with multiple locations. Is this what we
want to say? 
GH: Allen's server has feats w/ >1 location on same coord system. Do
we want to allow or disallow? If disallow, how?
AD: Possible usecase for alignments.
GH: Feat model for bioperl. Locations have multiple parts. Feats with
mult locations feels similar to that. Do you have multple children
each with a loc, or do you use the align element?
LS: Prefers children. That's what SO ended up doing after much
arguing. Makes it easier.
GH: Enforce it with the ontology. E.g, an alignment hit has alignment
hsps. 
This forces client to understand the ontology.
LS: Consider that an hsp will have scores attached to it, different
cigar line. So you  end up with mult children anyway. An improverished
type of alignment. Can use cigar line to indicate mismatches. Can have
a single HSP and a cigar line to indicate gaps. Only one child. You
don't have to have multiple locations
GH: Looking for use case of multiple locations with PCR products...
My main concern is how much semantic knowledge the clients need to
understand these things. Nothing in the spec that restricts mult
locations.
AD: Won't client just get the multiple children and not care about
types?
GH: I gues a simple client could do that. It disturbs me that it's up
the server how to handle multip loc, childrent, vs aligmnets.
Will send an example.
LS: Yes. this is a vague area. There should be a best-practices
section in the spec.
Single match feature from begin to end. HSP children, each one covers
major gaps. Cigar line w/in hsp to cover minor gaps. Can give each hsp
an alignment score.
GH: Main diff between locn and alignment is cigar string, and cigar
string is optional.
If we're allowed to use locations to designate alignments...
LS: How about if we consolidate location and alignment: location has
an optional cigar and then do away with alignment. Generalize
location to allow for gaps.
TD: Example: Aligning an est to the genome. Falls into several blocks
of exact/near exact matching. If location has cigar line, could serve
it up as a single feature.
GH: You can do this since cigar can represent arbitrary length gaps.
TD: Neat and compact way to do it. Does this scare anyone?
GH: Sounds reasonable.
AD: Let's do it. And will put in examples of best practices.

[A] Consolidate location and alignment in spec, loc has optional cigar

GH: Feats with mult parents. Need examples to test. This is a question
to people putting up servers. Will anyone have these?
TD: Ensembl might do this. Exon shared between several transcripts. A
toss up between multiple parents vs. multiple copies of same
exon. Think mult parents is the way to do it.
LS: Flybase use multiple parents for exons in this way.
TD: Ensemble db is a many-to-many between transcripts and exons.

GH: Spec says: If you have a child in the feat document, you have to
include its parent; If you have a parent you must include it's
children. As long as this plays policy nice with that requirement, I'm
ok with it. 
GH: Anyone else see things that need to be ironed out in spec?
AD: Not yet

NH: We should write a paper about das/2. This will help get more
people using it, increase the success of the spec.
GH: Agreed -- good idea. We have lots of text in grant about the
philosophy of das/2.
NH: Can pull text from these places. Publish at a conference perhaps?
ISMB, CSB2006
GH: PLoS Bioinformatics?
NH: Conference would be nice, to involve people in discussion.
AD: Poster session is available for ISMB.
NH: Prefers a conference talk. Paper will require more finished
stable. Poster is too much work for little payoff.
AD: Ann L complains that the only paper to cite for das is an old
ref. Wants an updatable citable paper.
NH: CSB will publish a proceedings.
Genome informatics at CSHL (they don't publish though).
NH/GH: What's the best conference to get published in these days?
LS: ISMB
NH: We missed deadline for it.
LS: Biocurators meeting?
NH: Can ask Sima about. Another one: Computational Genomics (TIGR
sponsored). Not published. Submit abstracts, they select
talks. Halloween in Baltimore. If conf proceedings are published, you
can't submit to a paper, so we could go that way, get double mileage
out of it.
GH: Sounds good to get something ready for a paper rather than a
conference. Did a presentation at Bosc, Genome informatics last year.

[A] Nomi will help get paper ready for PLoS (after code sprint)

AD: Can do poster for ismb, bosc in Brazil, if I end up going.
NH: ISMB deadline is 10 May, so we should get going on it

GH: Continuation grant submission, in theory has been reviewed, but
haven't heard back. Maybe will take another month, to get score back.
Final word?
LS: Have you checked ERA Commons? They may update it there before you
get the note.


From dalke at dalkescientific.com  Mon Mar 13 10:58:29 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 07:58:29 -0800
Subject: [DAS2] definition of coordinate system attributes
In-Reply-To: <e4fac3477606826ca6dbc80a11555d0a@sanger.ac.uk>
References: <3124ef2656aa51af817f16b1b71b16a2@sanger.ac.uk>
	<e5f47011c9258aecb1e4d21291de4aac@dalkescientific.com>
	<b4ce0b58feae2b49a77191059da213c9@sanger.ac.uk>
	<c52c87ec4f95cd9186d59ba463e5b76e@dalkescientific.com>
	<e4fac3477606826ca6dbc80a11555d0a@sanger.ac.uk>
Message-ID: <e5ea1776e51e938f6dc72ea26ee37f26@dalkescientific.com>

I've been exchanging emails with Andreas

>> Me?  I don't know what it's for.  Which means I've wiped it.
>
> is this a spec change? then I need to update the source response form 
> the new devel  dasregistry ...
>
> actually the new_spec.txt says it has not been changed since feb. 
> 10th...

I had hoped to have an updated spec by now.  (After all, the conf.
call is in an hour.)  That didn't happen.  :(

I've attached what I have so far.  I'll be working on it more today,
and getting things in CVS updated.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: draft3.txt
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060313/9e1e7b22/attachment.txt>
-------------- next part --------------


					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Mon Mar 13 11:47:32 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Mon, 13 Mar 2006 16:47:32 +0000
Subject: [DAS2] format information for the reference server
In-Reply-To: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com>
References: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com>
Message-ID: <e36fa2d7ad8152a2122357a01a3f4e03@sanger.ac.uk>


On 13 Mar 2006, at 14:00, Andrew Dalke wrote:

> Summary of questions:
>    - what does it mean for the annotation server to list the formats
>        available from the reference server?

should this happen? I thought that annotation servers are described by 
their "coordinate system"
then the registry provides a list of available references servers that 
provide the sequences for this.

> Something's been bothering me about the segments request.
>
> Currently the DAS sources request responds with something like
>
> <SOURCES>
>    <SOURCE>
>     <VERSION>
>       <CAPABILITY type="segments" query_url="http://blah/seq">
>          <FORMAT name="fasta" />
>          <FORMAT name="agp" />
>       </CAPABILITY>
>    ...
> </SOURCES>
>
> This says "go to 'blah' for information about the sequence".
>
> But it says more than that.  It provides metadata about
> the reference server.  It says that the reference server can
> respond in 'fasta' and 'agp' formats.

I think an annotation server should not know/provide this information
this should come from the reference server / registry


> If a client sees multiple CAPABILITY elements for the same
> query_url is it okay to merge the list of supported formats?

that does not sound clean.

> That is, if server X says that annotation server A supports
> fasta and server Y says that A supports genbank then a client
> may assume A supports both fasta and genbank formats?
> (This makes sense to me.)

the client should ask the reference server directly what it speaks /
rely on the registration server to have validated that server A speaks
indeed what  it says it does.

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Gregg_Helt at affymetrix.com  Mon Mar 13 12:13:14 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 13 Mar 2006 09:13:14 -0800
Subject: [DAS2] DAS/2 code sprint conference starting now
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA2A@msex02.affymetrix.com>

We just started the daily DAS/2 code sprint teleconference at
Affymetrix.
US number #: 800-531-3250
International #: 303-928-2693
Conference ID: 2879055
Passcode: 1365
 
 
From Gregg_Helt at affymetrix.com  Mon Mar 13 15:48:50 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 13 Mar 2006 12:48:50 -0800
Subject: [DAS2] Problem with name feature filter on biopackages server
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>

I'm looking into adding the ability in the IGB DAS/2 client to retrieve
features by name/id.  Trying this out with the biopackages server almost
gives me what I want:
 
http://das.biopackages.net/das/genome/yeast/S228C/feature?name=YGL076C
 
except that in the returned XML the parent feature (YGL076C) does not
list it's children as <PARTS>, though the children list YGL076C as
<PARENT>.  Any ideas?
 
            thanks!
            gregg
 

From nomi at fruitfly.org  Mon Mar 13 17:32:49 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 13 Mar 2006 14:32:49 -0800 (PST)
Subject: [DAS2] Where to publish [was Re: Notes from the weekly DAS/2
	teleconference, 6 Mar 2006]
In-Reply-To: <C03AD212.1CEBE%Steve_Chervitz@affymetrix.com>
References: <C03AD212.1CEBE%Steve_Chervitz@affymetrix.com>
Message-ID: <17429.62225.230884.764469@kinked.lbl.gov>

On 13 March 2006, Chervitz, Steve wrote:
 > NH/GH: What's the best conference to get published in these days?
 > LS: ISMB
 > NH: We missed deadline for it.
 > LS: Biocurators meeting?
 > NH: Can ask Sima about.

Sima said:
> Next biocurator meeting is probably in early 2007 in the UK. No plans at
> the moment to publish the proceedings, however.
> 
> I think publishing soon in PLoS is a good idea.


From dalke at dalkescientific.com  Mon Mar 13 18:45:04 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 15:45:04 -0800
Subject: [DAS2] URIs for sequence identifiers
Message-ID: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>

Proposals:
   - do not use segment "name" as an identifier
       - rename it "title" (human readable only)
       - allow a new optional "alias-of" attribute which is the
            link to the primary identifier for this segment

   - change the feature location to use the segment uri

   - change the feature filter range searches so there is a new "segment"
      keyword and so the "includes", "overlaps", etc. only work on
      the given segment, as
         segment=<uri>
         inside=$start:$stop
         overlaps=$start:$stop
         contains=$start:$stop
         identical=$start:$stop

   - If 'includes', 'overlaps', etc. are given then the 'segment'
       must be given (do we need this restriction?  It doesn't make
        sense to me to ask for "annotations on 1000 to 2000 of anything"

   - only allow at most one each of includes, overlaps,
       contains, or identical (do we need this restriction?)

   - multiple segments may be given, but then range searches
       are not supported (do we need this restriction?)

Discussion:

The discussion on this side of things was based on today's phone
conference.  Andreas needs data sources to work on multiple
coordinate spaces.

To quote from Andreas:
> There are several servers that understand more than one coordinate
> system and can return the same type of data in different coordinates.  
> (depending on which type of accession code/range was used for the
> request ) E.g. there are a couple of zebrafish servers that speak
> both  in Chromosome and Scaffold coordinates. (reason perhaps
> being that zebrafish is an organism that seems to be very difficult
> to assemble ?)

The current DAS system does not support this because of how
it does segment identifiers.

The current scheme looks like this:

<!-- sources.xml -->
<SOURCES ...>
   <SOURCE ...>
    <VERSION ...>
      <COORDINATES authority="Andreas" source="Scaffold" ... />
      <COORDINATES authority="Andreas" source="Chromosome" ... />
      <CAPABILITY type="segments" query_id="http://sanger/andreas/" />
      ....

Problem #1: We need two entry points, one to view the segments
in Scaffold space, the other to view them in Chromosome space.

Solution #1 (don't like it though).
Add a "source=" attribute to the CAPABILITY and allow multiple
segments capabilities

<!-- sources.xml -->
<SOURCES ...>
  <SOURCE ...>
   <VERSION ...>
    <COORDINATES authority="Andreas" source="Scaffold" ... />
    <COORDINATES authority="Andreas" source="Chromosome" ... />
    <CAPABILITY type="segments"
       query_id="http://sanger/andreas/scaffolds.xml" source="Scaffold"  
/>
    <CAPABILITY type="segments"
       query_id="http://sanger/andreas/chromosomes.xml"  
source="Chromosome" />
     ....


I don't like it because it feels like the COORDINATES and
CAPABILITY[type="segments"] field should be merged.  Still, I'll
go with it for now.

Problem #2: feature searches return features from either namespace

Consider search for name=*ABC* (that is, "ABC" as a substring in
the "name" or "alias" fields).  Then the result might be

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment="A/100:200" />
   </FEATURE>
</FEATURES>

Where "A" is a short-hand notation for one of the segments?
Which one?  The client goes to the segment servers:

Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
  <SEGMENT id="http://whatever.com/ChromosomeA" name="A" length="2000" />
</SEGMENTS>

Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
  <SEGMENT id="http://whatever.com/ScaffoldA" name="A" length="2000" />
</SEGMENTS>

The segment name "A" matches either ChromosomeA or ScaffoldA, and
there's no way to figure out which is correct!


This comes because our own naming scheme is not very good at
being globally unique.  We could fix it by also stating the
namespace in the result, as

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment="A/100:200" source="Scaffold"/>
   </FEATURE>
</FEATURES>

Gregg asked "why don't we just use the URI"?

After a long discussion we decided to propose just that.
That is, get rid of the "name" attribute.  Instead, use a
"title" attribute which is human readable and an optional
"alias-of" which contains is the primary identifier for
the given segment.

The alias-of value is determined by the person who
defined the COORDINATES.  It could be a URL.  It could
a URI.  It does not need to be resolvable (though it
should - perhaps to a human readable document?  Or to
something which lists all known aliases to it?)

That is, the segments document will look like this

Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ChromosomeA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Chromosome/A"
     title="Chromosome A" />
</SEGMENTS>

Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ScaffoldA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Scaffold/A"
     title="Scaffold A" />
</SEGMENTS>

This has a few implications.  Feature locations must be given
with respect to the segment uri, as

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment_uri="http://whatever.com/ScaffoldA" range="200:300"/>
   </FEATURE>
</FEATURES>

Given this segment_uri a client can figure out if it is in
Scaffold or Chromosome space because it can check all of the
URIs in each space for a match.


The other change is in range searches.  Consider the current
scheme, which looks like

   includes=ChrA
   includes=A/100:300

The query is of the form $ID or $ID/$start:$end.  It needs to be
changed to support URLs.  For examples,

   includes={http://www.whatever.com/ChromosomeA
   includes={http://www.whatever.com/ScaffoldA}/100:300

We couldn't come up with a better syntax.  Then Gregg asked
"why do we need multiple includes"?

That is, the current syntax supports
   includes=ChrA/0:1000;includes=ChrB/2000:3000;includes=ChrC/5000:6000

to mean "anywhere on the first 1000 bases of ChrA, the 3rd 1000
bases of ChrB, or the 6th 1000 bases of ChrC".

Given the query language, we're looking for way to write that
using URLs, as

    
includes={http://www.whatever.com/ChromosomeA}0:1000;includes={http:// 
www.whatever.com/ChromosomeB}:2000:3000;includes={http:// 
www.whatever.com/ChromosomeC}:5000:6000;

However, that's a very unlikely query.  What if we split the
"includes", "overlaps", etc. into "includes_segment" and  
"includes_range".
In that case:

   old-style:
includes=A/500:600
   new-style:
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_range=500:600

   old-style:
includes=A/500:600,Chr3/700:800
   new-style:
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_range=500:600;
includes_range=700:800

   old-style:
includes=A/500:600,D/700:800
   new-style: -- NOT POSSIBLE

   old-style:
includes=A/500:600,D/500:600
   new-style: (not likely to be used in real life)
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_segment=http://www.whatever.com/ChromosomeD; 
includes_range=500:600;

This no longer allows searches with subranges from different segments.

The again -- who cares?  Those sorts of searches are strange.

Talking some more.  Who needs the ability to do more than one
includes / overlaps / etc. query at a time?  Gregg wants the
ability to do a combination of includes and overlaps, but
that's all.

We can simplify the server code by only supporting one
inside search, one contains search, and/or one overlaps
search, instead of the current system which allows a more
constructive geometry, and we can move the segment id out
into its own parameter.

Allen said that that would prevent more complicated types
of analysis on the server, but that anyone doing more
complicated searches would pull the data down locally.

Does anyone want to do more than one overlaps search at
at time?  More than one contains search at a time?  More
than one identical search at a time?

(For that matter, does anyone actually want to do a "identical"
search?  Gregg thinks it will be useful to find any other
annotations which are exactly matching the given range.
I think that might be better with a "include"/"exclude" combination
to have start/end positions within a couple of bases from
the specified range.)

PROPOSAL:
   Change the range query language to have

segment=  <<the url of the segment to search>
inside= $start:$end
overlaps= $start:$end
contains= $start:$end

Example:

segment=http://whatever.com/ChromosomeD;inside=5000:6000

Also, only allow at most one includes, one overlaps, and
one contains (unless people want it).  I'm less sure about
the need for this restriction.  It might be as easy to
implement the more complex search as it would be to check
for the error cases.


					Andrew
					dalke at dalkescientific.com


From ed_erwin at affymetrix.com  Mon Mar 13 18:56:56 2006
From: ed_erwin at affymetrix.com (Ed Erwin)
Date: Mon, 13 Mar 2006 15:56:56 -0800
Subject: [DAS2] URIs for sequence identifiers
In-Reply-To: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
Message-ID: <441606C8.3070902@affymetrix.com>


Andrew Dalke wrote:

>>There are several servers that understand more than one coordinate
>>system and can return the same type of data in different coordinates.  
>>(depending on which type of accession code/range was used for the
>>request ) E.g. there are a couple of zebrafish servers that speak
>>both  in Chromosome and Scaffold coordinates. (reason perhaps
>>being that zebrafish is an organism that seems to be very difficult
>>to assemble ?)
> 
> 
> The current DAS system does not support this because of how
> it does segment identifiers.
> 
> 
> Problem #1: We need two entry points, one to view the segments
> in Scaffold space, the other to view them in Chromosome space.
> 
> Solution #1 (don't like it though).
> Add a "source=" attribute to the CAPABILITY and allow multiple
> segments capabilities


> Problem #2: feature searches return features from either namespace
> 

A different solution:

Scaffold and Chromosome coordinate systems are served by separate DAS/2 
servers.  Each server returns data from one and only one namespace.

Those separate servers can, behind-the-scenes, use the same database.

DAS/2 clients, like IGB, would choose to connect to either the 
Scaffold-based server or the Chromosome-based server, but not usually to 
both at once.

Does this handle all the issues?

Ed


From dalke at dalkescientific.com  Mon Mar 13 19:12:52 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 16:12:52 -0800
Subject: [DAS2] URIs for sequence identifiers
In-Reply-To: <441606C8.3070902@affymetrix.com>
References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
	<441606C8.3070902@affymetrix.com>
Message-ID: <54829d8554d9b044908965d80b158c60@dalkescientific.com>

Ed:
>> Problem #2: feature searches return features from either namespace
>
> A different solution:
>
> Scaffold and Chromosome coordinate systems are served by separate 
> DAS/2 servers.  Each server returns data from one and only one 
> namespace.
>
> Those separate servers can, behind-the-scenes, use the same database.
>
> DAS/2 clients, like IGB, would choose to connect to either the 
> Scaffold-based server or the Chromosome-based server, but not usually 
> to both at once.
>
> Does this handle all the issues?

Here's the email I got from Andreas when I proposed this.


>>> There may be more than one COORDINATE element if ... (XXX why?)
>
> There are several servers that understand more than one coordinate 
> system and
> can return the same type of data in different coordinates. (depending 
> on which type of accession code/range was used for the request )
> E.g. there are a couple of zebrafish servers that speak both  in 
> Chromosome and Scaffold coordinates.
> (reason perhaps  being that zebrafish is an organism that seems to be 
> very difficult to assemble ?)


>> Will there be separate CAPABILITY items for each source?
>
> no. if there are then this should be registered as two independent 
> servers.

(but see clarification later)

> Allowing multiple coordinate systems per server is a way to slightly 
> reduce the already long list of known
> servers. Currently there are about 90 in the registry (+10 in the last 
> few weeks...) and there still are about 20 more
>  which have not been registered (and are provided by the BioSapiens 
> project).

>> Long for who?  It isn't that much data.
>
> It is long for somebody who browses manually through the ensembl DAS 
> configuration and searches for a DAS source to add to.
> It a "long" list for myself to read through the DAS server list at
> http://das.sanger.ac.uk/registry/listServices.jsp
> and although I know this list pretty well,  it seems to me a lot of 
> text/descriptions, etc.


>> There is only one reference server for an annotation server.
>
> I think it should be one reference server per coordinate system.


>> But if there are two COORDINATES elements, and you say that
>> each has its own reference server, then aren't you saying that
>> a single annotation server may have multiple reference servers?
>
> yes. i believe that this should be possible.

>>  What's the concern about having
>> no more than one coordinate per data source?
>
> Just last friday somebody asked me how to add a DAS server that has 
> two coordinate systems to different Ensembl views ( ContigView and 
> GeneView)
> Her initial solution was to provide multiple DAS sources
> http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_211
> and
> http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_219
>
> but I think this could be joint into a single server.


In any case, I think the proposal I outlined in the previous email
makes things cleaner even without support for multiple coordinate
systems on the same server.

					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Mon Mar 13 23:22:36 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Mon, 13 Mar 2006 20:22:36 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day one, 13 Mar 2006
Message-ID: <C03B850C.1CF48%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day one, 13 Mar 2006

$Id: das2-teleconf-2006-03-13.txt,v 1.1 2006/03/14 04:31:36 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Sanger: Andreas Prlic
  Dalke Scientific: Andrew Dalke (at Affy)
  UC Berkeley: Nomi Harris (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


General note: 
Passcode is now required to enter teleconf.
This is a change in their system.


Issue: Continuation Grant
-------------------------

gh: no word yet.


Issue: Coordinate System
------------------------

ad: question of what happens when there are multiple coordinate
systems for an assembly.

auth and source,
source: contig space, scaffold space
auth: organization (e.g. ncbi, ucsc)

gh: not enough to get uniqueness.
ncbi, genome, human is not enough, need version
to uniquely id the coord system

ad: auth, source, species, version identification string
gh: use case: need to know whether uris for two versioned source
refer to the same genome.

gh: ncbi version numbers are separate from organism info, eg. v35.

ad: we could have a service for mapping strings

gh: idea - every server can say this assembly name is same as
that. Clients could chain together statements from multiple servers.
For the affy das server used by igb, we now have a synonyms file on
our server which igb reads. It's a pain to maintain.

ad: type of alignment server?
gh: a synonym server. Here's a uri, give me a list of synonyms that
refer to the same thing.

This is something tho talk more about when Andreas is on line.

[Andreas joins in.]

GH: How would a das server verify the version info in a sources
document point to same genome assembly?
AP: You would check auth=ncbi, vers=35, taxid=human
AP: In protein structure space, you check verison on every object you
work with. Protein seq.
gh: so we have to map version info on sequences as well as genome
assemblies. 
gh: use case: two segment responses from diff servers, diff uris for
the diff sequences, how you know they are refering to the same seq.
name=chromosome21 vs name=chr21?
ad: we require the same name for the same segments.
gh: going to fall apart fast. no way to enforce it. People use 1, I,
chr1, chromI.
ee: can put this in the validation suite.
aday: yes.
gh: but what do you use for name: accession # for entry, string chr1,
etc.
gh: important since this is the name that goes to user.
ad: could have one slot for computer to use, one for human consumption.
ad: for segments there seem to be two diff ids: url,
ad: the point of having special ids for segments is segment
equivalence from different servers. Separate coordinates element that
says how to merge things together. Identifiers in here that are just
coordinate space ids, not necessarily for human use. Only for identifying
coords.
gh: but how do we get people to use it?

sc: what about the idea of using checksums as identifiers for a seq?
ad: problem of duplicate seqs in an assembly. eg., same seq from chr1
and chr9.
gh: if they are the same seq they should get the same id.
ad: don't you want to know if there is a region on chr1 that is an
exact duplicate of a region on chr9?
sc: we could create the checksum on source:sequence

gh: useful to have a central place to ask for diff names for the same
coord system.
ad: uniqueness idea: coords element, has: auth, source, version,
species (optional) 
uniqueness says these are the names you use.
gh: this can fail. What do we say happens when it fails? Should there
be a way of resolving it.
ad: this is where your synonym table comes it. Publish it?
gh: maybe as part of the registry, knows

ap: there isn't a big variety in naming because there aren't many
people providing assemblies.
gh: we already have 10 different synonyms for an assembly
ee: this has some performance impact on igb. should have to do it.
ap: we should say this is how naming works.
gh: will fail.

ad: is this required for this version of the spec?
gh: need something that can be used now.
aday: without hardwiring
gh: if we don't agree during the code sprint, then it won't happen for
everyone else.
aday: using roman numerals for yeast since sgd uses it.
ee: trouble with chrX

ad: andreas: is there a place for naming of segments to use
ap: no, something for the reference server, not coords
ad: given these coords, here are the names that are used.
ap: same as reference server.

gh: maybe registry should provide: here's a coord system and here are
the names you can use for
ap: you would get a long list for proteins
aday: a user who wants to

gh: question for brian g: LSID, when you come across this for LSIDs,
ncbi is auth for human genome assembly yet they have no lsid for their
assembly, how do people refer to their lsid when there's no authority
to say what it is?
bg: you can't, no one is the authority. but you can write a resolver
that queries ncbi under the cover, in your resolver you make ncbi the
authority of the lsid, add namespace, object id. Then everyone has to
know that your resolver is hosted at some site somewhere. So there is
no satisfactory answer. It's a problem if the authority does not host
the resolver.
bg: I'm at the w3c meeting at mit, providing a webified resolver, they
would host a resolver, everyone would know to go to a well-known web
address. 

bg: you start a convention, enforce it, give error if people don't
use it.
gh: thinking we need it associated with registry.
ap: ref server + coord system, provides ids that can be used,
gh: so other ids can be used, but registry server wouldn't support it.

ad: site has ftp site for downloading chromosomes, contains names for
different segments in the file. How do I go from the ids in ths file
to the ids that Andreas describes.
To make my annotations in the same space. Mapping from file from ncbi.
bg: what are your use cases? write back to server?
ad: user publishing locally,
bg: you make a ref server.
gh: experience from das1 is that everyone makes their own reference
server and refers to it from their annotation server, using different
names. 
ad: new tag 'coordinates'
gh: like enforcing common names at registry server. Can use their own
names, they just won't be allowed to post on the registry.

ad: need documentation
ap: could point to docn on reference server

bg: workflow1: fish researcher looking for abberant regions in chr7,
11 and 3, singled out the abctransporter gene. How does that work in
das/2? type 'abc' in web page for reference server? This is a gene name.
ad: your client browser can go to to registry to find servers that
host the assemblies for your fish. Go to those reference servers, do
searches there. Will go to coord system, get a segments document, get
display chromosome by title.
gh: get a das features xml document saying the sequence and
coordinates.
gh: our discussion here is on getting the diff.
ad: we don't have anything on coordinates saying which is the latest
version.

bg: latest build may have changed their gene coordinate.
gh: mapping servers is part of our continuation grant. Can push an
annotation on one assembly to another assembly.
bg: a hard thing.
gh: that's why where enlisting UCSC to do it!

ad: Topic: id, url, uri, iri (see email)
gh: likes uri, not url. Some things aren't really urls
(resolvable). Iri might work.

ad: multiple coord elements for same ref server.
ap: originally there was one, but some use two, zebrafish guy chrom
and scaffold coordinates. or chromosomes vs. gene ids. same types,
different accession codes and features.
ad: if you have graphical browser, do you get scaffolds or
chromosomes.
ap: depends on your view.
gh: if you do a segments query, do you get segments and contigs?
ap: depending on the coordinate system of the requrest.
ad: one capabilities for scaffolds and one for chromosomes?
gh: maybe

Deliverables:
[A] gregg: by end of week, load stuff from multiple servers, compare in the
same view.

[A] steve will work on getting gregg's das/2 server up and running.

gh: trouble with biopackages.net server
aday: possible power outage interference.

gh: target filters have been  dropped.
aday: yay!


From dalke at dalkescientific.com  Tue Mar 14 10:14:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 07:14:44 -0800
Subject: [DAS2] use cases
Message-ID: <8bc46502eb164882394a3f4acbe08987@dalkescientific.com>

I think these cover the basic use cases.  Let me know if there
are other reasonable ones I should add.

Use Case #1

Biologist viewing genomic region wants to add information
from server www.biodas.org/das2/ .

Example of use:
   - Go to "open DAS server" option.  Type/paste URL for
DAS server.
   + DAS viewer connects to server, verifies that it
annotates the same sequence source and has under (say)
10 types so it makes a new track for each type and does
a request for all the features in the current display.


Use Case #2

Biologist wants all lac repressors on build 12 of mouse.

Example of use:
   - Start DAS viewer.  Go to "find server" option.  Select
"mouse" from the list of "model organisms".  Select "build 12"
from a pull-down menu of build descriptions.  Select all
the listed servers.
   - Go to "find annotations" option

Now what?  Is "lac repressor" a name?  Is it a combination
of a name and ontology term?  Is it a pure ontology term?


Use Case #3

Biologist wants to find all the annotation servers for the most
recent build of H. sapiens.

Example of use:
  - Start DAS viewer.  Go to "find server" option.  Type "human"
(or "H. sapiens" or "Homo sapiens").  Search.
  + DAS viewer consults internal NCBI taxonomy table to get taxid.
DAS viewer displays all matches.
  - Sort by build date, select all matching servers by hand

Problem:
   DAS has no field to search by build date


Use Case #4

Bioinformaticist wants to make annotations available for
build v32 of human.

Example of use:
   - Go to registry server to get a human-readable description
of the COORDINATES fields for build v32.
   - decide to point people to a reference server instead of
providing local sequence data
   - create the sources, types and features document
   - put them on a web server
   - go to registry and submit site for future inclusion


Use Case #5

IT wants people to use local mirrors of reference
server when possible.

Example of use:
   - set up a local registry server
   + server connects to Andreas' registry server and downloads
all the data
   + server rewrites "segments" sections to use local server
   - configure all DAS viewers to consult local registry server


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Tue Mar 14 10:13:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 07:13:44 -0800
Subject: [DAS2] using 'uri' instead of 'id'
Message-ID: <9779f55861a4e800d0d21ec8d96deb8c@dalkescientific.com>

Okay, I'm convinced.

Where things in the spec use 'id' they will now use 'uri'.

There are going to be a few wide-spread but shallow
changes because of this.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Tue Mar 14 11:09:12 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 08:09:12 -0800
Subject: [DAS2] segments and coordinates
Message-ID: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com>

Summary:  I want to
   - move the COORDINATE element inside of the
         CAPABILITY[type="segments"] element

   - add a 'created' timestamp to the COORDINATE (for sorting by time)

   - add a unique 'uri' identifier attribute to the COORDINATE
      (two coordinates are equal if and only if they have the same id)

   - have that identifier be resolvable, to get information about
       the coordinate system (but perhaps leave the contents for a
       future spec)

In writing the documentation I've been struggling with
COORDINATES.  No surprise there.

The current spec has COORDINATES and the "segments" capability
as different elements, like

<COORDINATES source="Chromosome" authority="NCBI" version="v22"
        taxid="9606" created="2006-03-14T07:27:49" />
<CAPABILITY type="segments"
     query_id="http://localhost/das2/h.sapiens/v22/segments" />

(Note the 'created' timestamp to sort a list of coordinates
by the time it was established.)

With the current discussion on multiple coordinates, it
looks like there is a 1-to-1 relationship between a COORDIANTES
record and a CAPABILITY record.  As that's the case I want
to merge them together, as in (note change from "_id" to "_uri")


<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES source="Chromosome" authority="NCBI" version="v22"
          taxid="9606" created="2006-03-14T07:27:49" />
</CAPABILITY>

In talking with Andreas I think he agrees that this makes sense.


Second, there's a question of identity.  When are two coordinates
the same?  Is it when they have the same
   (authority, source, version)
the same
   (authority, source, version, taxid)

Since taxid is optional, what if one server leaves it out;
are the two still the same?

I decided to solve it with a unique identifier.  Two
COORDINATES are the same if and only if they have the
same identifier.  That identifier just happens to be
a URI.  It does not need to be resolvable (but should
be, with the results viewable at least for humans).

Let's say that
   http://das.sanger.ac.uk/registry/coordinates/ABC123
is the identifier for:
   authority=NCBI
   version=v22
   taxid=9606
   source=Chromosome
   created=2006-03-14T07:27:49

Then the following are equivalent.  The only difference is the
number of properties defined in the COORDINATES tag.

<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES 
uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" />
</CAPABILITY>


<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
       source="Chromosome"/>
</CAPABILITY>


<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
      source="Chromosome" authority="NCBI" version="v22" taxid="9606"
      created="2006-03-14T07:27:49" />
</CAPABILITY>


In theory these extra values don't need to be in the COORDINATES
tag.  They are knowable given the uri.  But that requires a
discovery mechanism for the properties (eg, the COORDINATES identifier
might need to be retrievable, with some format or other).

There is the possibility of value mismatch, but as Andreas pointed
out the registry server can do that validation pretty easily.


I mentioned property discovery earlier.  Given a coordinates URI
there are three things you might want to know:
   - what is the full list of coordinate system properties?
   - what is the authoritative reference server for the coordinates?
   - are there alternate reference servers?

What if that was resolvable (doesn't need to be defined for DAS,
so this is hypothetical) into something like

<COORDINATE-SYSTEM doc_href="something for humans to read">
   <!-- definitive information about this coordinate system -->
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
       source="Chromosome" authority="NCBI" version="v22" taxid="9606"
       created="2006-03-14T07:27:49" />
   <SEGMENT-SERVER uri="http://whatever/" is-authoritative="yes" />
   <SEGMENT-SERVER uri="http://mirror1/"/>
   <SEGMENT-SERVER uri="http://mirror2/"/>
   <SEGMENT-SERVER uri="http://mirror3/"/>
</COORDINATE-SYSTEM>

(Hmmm, those are some ugly names.  I usually shy away from '-'s
in element and attribute names.)


OR, what if the authoritative URL also implemented the segments
interface, and we added a COORDINATES element to it?  Errr, I
don't like that.  We will be in charge of the coordinate
system URIs but we won't be in charge of the primary reference
server.

Use Case #6.

NCBI releases a new human build.  Ensembl releases annotations
for it and wants to put the information with Andreas' registry.

Example of use:
   - Set up an Ensembl reference server and annotation server
        for the new build; test it out
   - Create a new coordinate system record on the registry
      - fill in the species, source, doc_href, etc. fields
      - when finished the result is a URL, tied to coordinate info
   - Stick the COORDINATES information in the versioned
        source record
   - Tell the registry server to register the given versioned
        source URL


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Tue Mar 14 11:21:54 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 08:21:54 -0800
Subject: [DAS2] today's sprint meeting
Message-ID: <b16e929488b3a77c0771fa9297984cff@dalkescientific.com>

Gregg can't make it this morning and asked that I let today's
meeting.  Here are the things I would like to talk about:

== segment identifier.

Quoting from my email yesterday

   - do not use segment "name" as an identifier
       - rename it "title" (human readable only)
       - allow a new optional "alias-of" attribute which is the
            link to the primary identifier for this segment

<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ChromosomeA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Chromosome/A"
     title="Chromosome A" />
</SEGMENTS>


   - change the feature location to use the segment uri

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment_uri="http://whatever.com/ScaffoldA" range="200:300"/>
   </FEATURE>
</FEATURES>

   - change the feature filter range searches so there is a new "segment"
      keyword and so the "includes", "overlaps", etc. only work on
      the given segment, as
         segment=<uri>
         inside=$start:$stop
         overlaps=$start:$stop
         contains=$start:$stop
         identical=$start:$stop

http://biodas.org/feature.cgi?segment=http://whatever.com/ChromosomeD; 
inside=5000:6000
(with URL escaping rules for the query string that's
       
...feature.cgi? 
segment=http%3A%2F%2Fwhatever.com%2FChromosomeD&inside=5000%3A6000

   - If 'includes', 'overlaps', etc. are given then the 'segment'
       must be given (do we need this restriction?  It doesn't make
        sense to me to ask for "annotations on 1000 to 2000 of anything"

   - only allow at most one each of includes, overlaps,
       contains, or identical (do we need this restriction?  Then again,  
Gregg
       only needs a single includes and a single overlaps; perhaps make  
this
       even more restrictive?)

   - multiple segments may be given, but then range searches
       are not supported (do we need this restriction?)

Consensus on this side seems to be fine.  The biggest worry is the
increasing use of URIs in URL query strings.


== coordinate systems

Quoting from an email I wrote recently

   - move the COORDINATE element inside of the
         CAPABILITY[type="segments"] element

   - add a 'created' timestamp to the COORDINATE (for sorting by time)

   - add a unique 'uri' identifier attribute to the COORDINATE
      (two coordinates are equal if and only if they have the same id)

Result looks like

<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
      source="Chromosome" authority="NCBI" version="v22" taxid="9606"
      created="2006-03-14T07:27:49" />
</CAPABILITY>


   - have that identifier be resolvable, to get information about
       the coordinate system (but perhaps leave the contents for a
       future spec)

== use 'uri' instead of 'id' in the spec

I've decided to go with 'uri' instead of 'id' (or 'url' or 'iri')
in its various uses in the spec.

== churn

My feeling is this is the last major churn.  I'm not able to keep
up with the documentation writing, which makes it hard for people
to get things done.

Should I work with people today on getting data sources working
and developing example data files for people to review?  That is,
examples which show and explain the various element in the spec?
I figure more people work from example than from spec description.

					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Tue Mar 14 11:35:07 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue, 14 Mar 2006 16:35:07 +0000
Subject: [DAS2] URIs for sequence identifiers
In-Reply-To: <441606C8.3070902@affymetrix.com>
References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
	<441606C8.3070902@affymetrix.com>
Message-ID: <0cd005042c73d6080c568576a08bb987@sanger.ac.uk>

>
> A different solution:
>
> Scaffold and Chromosome coordinate systems are served by separate DAS/2
> servers.  Each server returns data from one and only one namespace.
>
> Those separate servers can, behind-the-scenes, use the same database.
>
> DAS/2 clients, like IGB, would choose to connect to either the
> Scaffold-based server or the Chromosome-based server, but not usually  
> to
> both at once.
>
> Does this handle all the issues?


Hm I see this as a possibility but what about the following:


<SOURCES>
<SOURCE id="DS2_1" title="yeastdas2server1"  
doc_href="http://cvs.biodas.org/cgi-bin/viewcvs/viewcvs.cgi/das/das2/ 
new_spec.txt?rev=1.6&cvsroot=biodas&content-type=text/vnd.viewcvs- 
markup">
?
	<VERSION id="latest" created="2006-02-08">
	<MAINTAINER email="allenday at ucla.edu"/>
?
	<COORDINATES taxid="4932" source="Chromosome" authority="SGD"  
test_range="chrVII/364251:366080">
	<VERSION name="32"/>
	</COORDINATES>
	<CAPABILITY type="segments"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/segment"/>
	<CAPABILITY type="types"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/type"/>
	</VERSION>
</SOURCE>

<SOURCE id="DS2_2" title="yeastdas2server2"  
doc_href="http://cvs.biodas.org/cgi-bin/viewcvs/viewcvs.cgi/das/das2/ 
new_spec.txt?rev=1.6&cvsroot=biodas&content-type=text/vnd.viewcvs- 
markup">
?
	<VERSION id="latest" created="2006-02-08">
	<MAINTAINER email="allenday at ucla.edu"/>
?
	<COORDINATES taxid="4932" source="Gene_ID" authority="SGD"  
test_range="ydr409w">
	<VERSION name="32"/>
	</COORDINATES>
	<CAPABILITY type="segments"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/segment"/>
	<CAPABILITY type="types"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/type"/>
	</VERSION>
</SOURCE>
</SOURCES>


This would be how to write one server which has two coordinate systems.  
according to the "one coord sys/server" rule.
I think it would be shorter to provide two coordinates sections for  
that and only one source description...


--- fyi, a yeast by Gene_ID server is  e.g.
http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_169


Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From ap3 at sanger.ac.uk  Tue Mar 14 11:48:09 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue, 14 Mar 2006 16:48:09 +0000
Subject: [DAS2] segments and coordinates
In-Reply-To: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com>
References: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com>
Message-ID: <ad94ac7cea5c8a07117d4785879b8a79@sanger.ac.uk>


On 14 Mar 2006, at 16:09, Andrew Dalke wrote:

> Summary:  I want to
>    - move the COORDINATE element inside of the
>          CAPABILITY[type="segments"] element

Is this really needed?

> The current spec has COORDINATES and the "segments" capability
> as different elements, like
>
> <COORDINATES source="Chromosome" authority="NCBI" version="v22"
>         taxid="9606" created="2006-03-14T07:27:49" />
> <CAPABILITY type="segments"
>      query_id="http://localhost/das2/h.sapiens/v22/segments" />


> With the current discussion on multiple coordinates, it
> looks like there is a 1-to-1 relationship between a COORDIANTES
> record and a CAPABILITY record.  As that's the case I want
> to merge them together, as in (note change from "_id" to "_uri")

I think hat this is a many to many relationship.
Do you still want to provide the link to the reference server from an 
annotation server?
This is not needed because the coordinates describe the reference 
server sufficiently.

Annotation servers do not need the segments capability - only the 
features capability.


> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES source="Chromosome" authority="NCBI" version="v22"
>           taxid="9606" created="2006-03-14T07:27:49" />
> </CAPABILITY>
>
> In talking with Andreas I think he agrees that this makes sense.

If you really *want* to have the link back from the annotation server 
to the reference then
I would propose to put capability under coordinates - i.e. the other 
way round.


> econd, there's a question of identity.  When are two coordinates
> the same?  Is it when they have the same
>    (authority, source, version)
> the same
>    (authority, source, version, taxid)

yes

>
> Since taxid is optional, what if one server leaves it out;
> are the two still the same?

no - because if a taxid is specified that is a restriction for one 
organism. no taxid means that  this refers to multiple organisms.


> I decided to solve it with a unique identifier.

that might be good. this identifier could also be used to restrict 
searches on servers with many coordinate systems.

>
> Let's say that
>    http://das.sanger.ac.uk/registry/coordinates/ABC123
> is the identifier for:
>    authority=NCBI
>    version=v22
>    taxid=9606
>    source=Chromosome
>    created=2006-03-14T07:27:49

fine


> Then the following are equivalent.  The only difference is the
> number of properties defined in the COORDINATES tag.
>
> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES
> uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" />
> </CAPABILITY>
>
>
> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES 
> uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
>        source="Chromosome"/>
> </CAPABILITY>
>
>
> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES 
> uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
>       source="Chromosome" authority="NCBI" version="v22" taxid="9606"
>       created="2006-03-14T07:27:49" />
> </CAPABILITY>

o.k.

This is a lot of change to the spec for us being already on the second 
code sprint,
but I think it makes things clearer

Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From dalke at dalkescientific.com  Tue Mar 14 15:46:27 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 12:46:27 -0800
Subject: [DAS2] description and title
Message-ID: <84c508c1625b5507dd511c8d1ef0f682@dalkescientific.com>

Andreas' DAS registry has a description for each versioned source.
See http://das.sanger.ac.uk/registry/listServices.jsp .

Here's an example of what's in it

     Machine learning approach used SWISSPROT variants annotated as
     disease/neutral as training dataset. Predictions made on all
     ENSEMBL nscSNPs as to their disease status

I've added an optional 'description' field to the versioned source
record for servers that wish to provide that information.


Allen's types response had 'name' and 'description' attributes.
These were not in the types record.  I've added 'description' and
added 'title'.

I've been using 'title' for short descriptions; a few words long.
I've been using 'description' for plain text up to a paragraph.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Tue Mar 14 19:34:55 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 16:34:55 -0800
Subject: [DAS2] updated examples
Message-ID: <dd454f0a2558990438f7b58e779406b0@dalkescientific.com>

Checked into das CVS.

das/das2/draft3/

The current (incomplete) spec is 'spec.txt'.  It is already out of date.
The .rnc files are up-to-date.

The subdirectory "ucla" contains data from Allen's server,
with the format hand-updated.

A couple of things to note.  I used three different ways of
specifying the same namespace:

<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00">

<DAS2:FEATURES xmlns:DAS2="http://www.biodas.org/ns/das/genome/2.00">

<das:SEGMENTS xmlns:das="http://www.biodas.org/ns/das/genome/2.00">


This is to check that you all are doing correct namespace processing.  
:)

Also, I've gone ahead and added the 'SUPPORTS' element, like this

       <CAPABILITY type="features" query_uri="yeast/features.xml">
         <SUPPORTS name="basic" />
       </CAPABILITY>

This says that the server only supports 'basic' searches, which means
you can only ask it for all the feature.  There is no feature query 
language.
There is also 'das2queries' which says that the server supports the
das2 query language.  The following says that you can ask for everything
or you can ask for things in the DAS2 query language.

       <CAPABILITY type="features" query_uri="yeast/features.xml">
         <SUPPORTS name="basic" />
         <SUPPORTS name="das2queries" />
       </CAPABILITY>

If not given the client should assume it supports 'das2queries'.
Note that 'basic' is a subset of 'das2queries'.


					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Wed Mar 15 05:46:41 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 10:46:41 +0000
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
Message-ID: <200603151046.43196.lstein@cshl.edu>

Hi Folks,

I just ran through the source request on biopackages.net and it is returning 
something that is very different from the current spec (CVS updated as of 
this morning UK time). I understand why there is a discrepancy, but for the 
purposes of the code sprint, should I code to what the spec says or to what 
biopackages.net returns? It is much more fun for me to code to a working 
server because I have the opportunity to watch my code run.

Best,

Lincoln

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Wed Mar 15 05:39:35 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 10:39:35 +0000
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
Message-ID: <200603151039.36405.lstein@cshl.edu>

Hi Folks,

Shouldn't the prefix to das2 requests be http://server/blahblah/das2   ?

It would make it easier for clients to load the correct parsing code and would 
avoid the client having to make a round-trip to the server just to determine 
whether it is dealing with a das/1 or das/2 server.

My apologies if this has already been discussed.

Lincoln

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Wed Mar 15 09:32:26 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 06:32:26 -0800
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <200603151046.43196.lstein@cshl.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
Message-ID: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com>

Lincoln:
> I just ran through the source request on biopackages.net and it is  
> returning
> something that is very different from the current spec (CVS updated as  
> of
> this morning UK time).

The server isn't synched with any specific version of the spec. For
example, if I make a features request from

http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/ 
0:1000")

I get

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DAS2FEATURE SYSTEM  
"http://www.biodas.org/dtd/das2feature.dtd">
<FEATURELIST
   xmlns="http://www.biodas.org/ns/das/2.00"
   xmlns:xlink="http://www.w3.org/1999/xlink"
   xml:base="http://das.biopackages.net/das/genome/yeast/S228C/feature">
</FEATURELIST>

As from the discussion a few weeks ago we shouldn't be using the
   standalone="no"
since that says the document cannot be understood without consulting
the DTD, which doesn't exist.  And I don't want a DTD.

Also, the namespace needs to be  
"http://www.biodas.org/ns/das/genome/2.00"
(It's missing the 'genome') and the 'FEATURELIST' was replaced with
'FEATURES' a year ago.

In the types request

<?xml version="1.0" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="/xsl/das-genome-type.xsl"?>
<!DOCTYPE DAS2TYPES SYSTEM "http://www.biodas.org/dtd/das2types.dtd">
<!--
      xmlns="http://www.biodas.org/ns/das/genome/2.00"
-->
<TYPES
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xml:base="http://das.biopackages.net/das/genome/yeast/S228C/type/">
   <TYPE id="SO:ARS" ontology="/das/ontology/obo/1/ontology/SO/0000436"  
name="ARS" definition="A sequence that can autonomously replicate, as a  
plasmid, when transformed into a bacterial host.">


the commented out namespace declaration needs to there, and the type
id 'SO:ARS' needs to be escaped as it's treated as an identifier  
resolved
with the "SO" protocol.  Plus, until yesterday I didn't know about the
'name' or 'definition' attributes.  These are now in the schema as
'title' and 'description'.

There are a few other differences, like problems in the taxid and
empty strings for timestamps.  I hand-updated examples from Allen's
server yesterday, in cvs under das/das2/draft3/ucla .  I found some
of these during the update, though others I pointed out about a
year ago.

Allen doesn't want to update the server until the spec is stable,
for two reasons.  First, he doesn't like the churn of doing work only
to have to make more changes.  Second, you're not the only one who says

>  It is much more fun for me to code to a working
> server because I have the opportunity to watch my code run.

and Allen's setup doesn't have the ability to implement two versions
at the same time.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 09:46:39 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 06:46:39 -0800
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <200603151039.36405.lstein@cshl.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
Message-ID: <d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>

> Shouldn't the prefix to das2 requests be http://server/blahblah/das2   
> ?
>
> It would make it easier for clients to load the correct parsing code 
> and would
> avoid the client having to make a round-trip to the server just to 
> determine
> whether it is dealing with a das/1 or das/2 server.

It doesn't need the round-trip.  It can look at the Content-Type to
figure that out.

Plus, few of the DAS1 servers follow the DAS1 naming scheme.  Here's
a list from Andreas' registry server.

genome.cbs.dtu.dk:9000/das/tmhmm/
genome.cbs.dtu.dk:9000/das/netoglyc/
das.ensembl.org/das/ens_sc1_ygpm/
atgc.lirmm.fr/cgi-bin/das/MethDB/
smart.embl.de/smart/das/smart/
supfam.org/SUPERFAMILY/cgi-bin/das/up/
mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/

All of them do have the substring '/das/' somewhere, but not
at the start/end of the string.

Now, the content-type might be "application/xml" and not sufficient
to disambiguate between the two documents, but in that case you can
dispatch based on the root element type.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 10:05:52 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:05:52 -0800
Subject: [DAS2] XML namespaces
Message-ID: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>

I mentioned this yesterday but am doing it again as its own email.
This is a quick tutorial on XML namespaces.

The DAS spec uses XML namespaces.  XML didn't start with namespaces.
They were added later.  Older parsers, like SAX 1.0, did not understand
namespaces.  Newer ones, like SAX 2.0, do.

By default a document does not have a namespace.  For example,

<person name="Andrew" />

has no namespace.

To declare a default namespace use the 'xmlns' attribute.  All
attributes which start 'xml' or are in the 'xml:' namespace are
reserved.

<person name="Andrew" xmlns="http://www.biodas.org/" />

This is the name 'person' in the namespace 'http://www.biodas.org/'.
The namespace is an opaque identifer.  It leverages URIs in part
because it's much easier to guarantee uniqueness.

The combination of (namespace, tag name) is unique.  The tag
name is also called the "local name".

That's to distinguish it from a "qualified name", also called
a "qname".  These look like

<abc:person name="Andrew" xmlns:abc="http://www.biodas.org/" />

This element has identical meaning to the previous element
using the default namespace.  It's qname is 'abc:person' but
the full name is the tuple of

    ("http://www.biodas.org/", "person")

For notational convenience this is sometimes written in Clark
notation, as
   {http://www.biodas.org}person

   Element                                     Clark notation
<person />                                      person
<person xmlns="" />                             {}person
                            ("empty namespace" is different than "no  
namespace")

<person xmlns="http://biodas.org/" />            
{http://biodas.org/}person
<das:person xmlns:das="http://biodas.org/" />    
{http://biodas.org/}person
<X:person xmlns:X="http://biodas.org/" />        
{http://biodas.org/}person

The prefix used doesn't matter.  Only the combination of
   (namespace, local name)
is important.  The Clark notation string captures that as a single  
string,
which is much easier when doing comparisons.

For example, if you try the dasypus verifier at
    
http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/ 
genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features

one of the output messages is

Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES'  
but
got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line  
3, column 2

This shows the Clark name for the elements, indicating that the root
element has a different namespace and local name from what Dasypus  
expects.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 10:15:40 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:15:40 -0800
Subject: [DAS2] xml namespaces
Message-ID: <fee5e61fc425e4257406e81fba814a66@dalkescientific.com>

related to the previous email.  The spec uses the namespace

    http://www.biodas.org/ns/das/genome/2.00

I propose using a smaller and simpler URL.

The content does not matter to XML processors.  The practice though
is to use a URI which is resolvable for more information about the
element. For example,

      xmlns:xlink="http://www.w3.org/1999/xlink"

Go to that and the response is

> This is an XML namespace defined in the XML Linking Language (XLink) 
> specification.
>
> For more information about XML, please refer to The Extensible Markup 
> Language (XML) 1.0 specification. For more information about XML 
> namespaces, please refer to the Namespaces in XML specification.


Similarly the XML namespace URI is
   http://www.w3.org/1999/xhtml
XSLT is
   http://www.w3.org/1999/XSL/Transform

FOAF is
   http://xmlns.com/foaf/0.1/
which points to the actual documentation

I like the last approach and propose that DAS2 use the namespace

      http://biodas.org/documents/das2/


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 10:22:14 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:22:14 -0800
Subject: [DAS2] xml namespaces
In-Reply-To: <fee5e61fc425e4257406e81fba814a66@dalkescientific.com>
References: <fee5e61fc425e4257406e81fba814a66@dalkescientific.com>
Message-ID: <b8e904b97e7bd99c3cecbbc80df85a43@dalkescientific.com>

Me:
> I propose using a smaller and simpler URL.
  ...
> I like the last approach and propose that DAS2 use the namespace
>
>      http://biodas.org/documents/das2/

But it's such a minor point that not changing it is fine with me.

On the other hand, Allen's server doesn't given the right namespace
and Gregg's client currently ignores the namespace, so there isn't
any extra work.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 10:29:56 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:29:56 -0800
Subject: [DAS2] search by segment id
Message-ID: <712b5b29c53161455f3d9d1b34768937@dalkescientific.com>

One thing I came up with yesterday when moving from local identifiers
to URIs for the segment names.  There are two possible identifiers
for a given segment

<das:SEGMENTS xmlns:das="http://www.biodas.org/ns/das/genome/2.00"
        xml:base="http://localhost/das2/">
  <das:SEGMENT uri="segment/chr1" title="Chromosome 1" length="246127941"
    synonym="http://dalkescientific.com/human35v1/chr1"
    doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=1" />


The local name is "http://localhost/das2/segment/chr1" while
the well-known global name (of which the local name is an alias) is
"http://dalkescientific.com/human35v1/chr1"

The global name can be anything.  It can be "urn:lsid:chr1" or
anything else.  It only needs to be unique across all identifiers.

Now, are range queries done with the local name or the global one?

That is,   
features?segment=http://localhost/das2/segment/chr1&range=100:200

      or  
features?segment=http://dalkescientific.com/human35v1/chr1&range=100: 
200
    ( or features?segment=urn:lsid:chr1&range=100:200 if that was the  
uri)


If it's the local name then the client must first query all servers
to get the mapping from global name to local name, and perform the
translation itself.

I propose that the client can query using the global name, and not
need to do the mapping to the local name.  In addition, a server
may support both names in the query, since by using URIs we guarantee
there are no accidental id collisions.

					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Wed Mar 15 10:34:06 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Wed, 15 Mar 2006 15:34:06 +0000
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
Message-ID: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk>

>
> genome.cbs.dtu.dk:9000/das/tmhmm/
> genome.cbs.dtu.dk:9000/das/netoglyc/
> das.ensembl.org/das/ens_sc1_ygpm/
> atgc.lirmm.fr/cgi-bin/das/MethDB/
> smart.embl.de/smart/das/smart/
> supfam.org/SUPERFAMILY/cgi-bin/das/up/
> mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/

all these servers match to the DAS 1 spec which says that the second to 
last bit
is "das" and the last bit is the "data source name".
The registry contains a check for that.

Andreas

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From td2 at sanger.ac.uk  Wed Mar 15 10:16:25 2006
From: td2 at sanger.ac.uk (Thomas Down)
Date: Wed, 15 Mar 2006 15:16:25 +0000
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
Message-ID: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk>


On 15 Mar 2006, at 14:46, Andrew Dalke wrote:

> Plus, few of the DAS1 servers follow the DAS1 naming scheme.  Here's
> a list from Andreas' registry server.
>
> genome.cbs.dtu.dk:9000/das/tmhmm/
> genome.cbs.dtu.dk:9000/das/netoglyc/
> das.ensembl.org/das/ens_sc1_ygpm/
> atgc.lirmm.fr/cgi-bin/das/MethDB/
> smart.embl.de/smart/das/smart/
> supfam.org/SUPERFAMILY/cgi-bin/das/up/
> mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/

These all look fine to me -- but they're URLs for individual data  
sources, rather than complete server installations.  Remove the last  
element and you'll get a server URL (e.g. genome.cbs.dtu.dk:9000/ 
das/) which ends /das/ in all cases.

The registry records datasources, not server installations.  In  
general, I'm not sure a server installation is a terribly  
"interesting" object, since it's quite possible that one server  
installation will host many datasources with little or no semantic  
connection between them -- the only thing they have in common is that  
they're hosted at the same site.

          Thomas.


From lstein at cshl.edu  Wed Mar 15 10:41:46 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 15:41:46 +0000
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
	<4d86b8f899632c8cd506297938fffd8a@dalkescientific.com>
Message-ID: <200603151541.47538.lstein@cshl.edu>

I'll use your hand-edited examples for testing.

Lincoln

On Wednesday 15 March 2006 14:32, Andrew Dalke wrote:
> Lincoln:
> > I just ran through the source request on biopackages.net and it is
> > returning
> > something that is very different from the current spec (CVS updated as
> > of
> > this morning UK time).
>
> The server isn't synched with any specific version of the spec. For
> example, if I make a features request from
>
> http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/
> 0:1000")
>
> I get
>
> <?xml version="1.0" standalone="no"?>
> <!DOCTYPE DAS2FEATURE SYSTEM
> "http://www.biodas.org/dtd/das2feature.dtd">
> <FEATURELIST
>    xmlns="http://www.biodas.org/ns/das/2.00"
>    xmlns:xlink="http://www.w3.org/1999/xlink"
>    xml:base="http://das.biopackages.net/das/genome/yeast/S228C/feature">
> </FEATURELIST>
>
> As from the discussion a few weeks ago we shouldn't be using the
>    standalone="no"
> since that says the document cannot be understood without consulting
> the DTD, which doesn't exist.  And I don't want a DTD.
>
> Also, the namespace needs to be
> "http://www.biodas.org/ns/das/genome/2.00"
> (It's missing the 'genome') and the 'FEATURELIST' was replaced with
> 'FEATURES' a year ago.
>
> In the types request
>
> <?xml version="1.0" standalone="no"?>
> <?xml-stylesheet type="text/xsl" href="/xsl/das-genome-type.xsl"?>
> <!DOCTYPE DAS2TYPES SYSTEM "http://www.biodas.org/dtd/das2types.dtd">
> <!--
>       xmlns="http://www.biodas.org/ns/das/genome/2.00"
> -->
> <TYPES
>       xmlns:xlink="http://www.w3.org/1999/xlink"
>       xml:base="http://das.biopackages.net/das/genome/yeast/S228C/type/">
>    <TYPE id="SO:ARS" ontology="/das/ontology/obo/1/ontology/SO/0000436"
> name="ARS" definition="A sequence that can autonomously replicate, as a
> plasmid, when transformed into a bacterial host.">
>
>
> the commented out namespace declaration needs to there, and the type
> id 'SO:ARS' needs to be escaped as it's treated as an identifier
> resolved
> with the "SO" protocol.  Plus, until yesterday I didn't know about the
> 'name' or 'definition' attributes.  These are now in the schema as
> 'title' and 'description'.
>
> There are a few other differences, like problems in the taxid and
> empty strings for timestamps.  I hand-updated examples from Allen's
> server yesterday, in cvs under das/das2/draft3/ucla .  I found some
> of these during the update, though others I pointed out about a
> year ago.
>
> Allen doesn't want to update the server until the spec is stable,
> for two reasons.  First, he doesn't like the churn of doing work only
> to have to make more changes.  Second, you're not the only one who says
>
> >  It is much more fun for me to code to a working
> > server because I have the opportunity to watch my code run.
>
> and Allen's setup doesn't have the ability to implement two versions
> at the same time.
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Wed Mar 15 10:49:40 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 15:49:40 +0000
Subject: [DAS2] XML namespaces
In-Reply-To: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>
References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>
Message-ID: <200603151549.41773.lstein@cshl.edu>

I have just finished adding XML namespace support to the early-version Perl 
DAS2 client. BTW, if a namespace tag is reused in an inner scope with a 
different 

	<das:name xmlns:das="http://foo.bar/das" />
		<das:first>Andrew</das:first>
		<das:middle xmlns:das="http://addresses.com/address/2.0">K.</das:middle>
                <das:last>Dalke</das:last>
          </das:name>

I put middle into namespace http://addresses.com/address/2.0 and put first and 
last into namespace http://foo.bar.das.

This is the correct scoping behavior, right?

Lincoln

On Wednesday 15 March 2006 15:05, Andrew Dalke wrote:
> I mentioned this yesterday but am doing it again as its own email.
> This is a quick tutorial on XML namespaces.
>
> The DAS spec uses XML namespaces.  XML didn't start with namespaces.
> They were added later.  Older parsers, like SAX 1.0, did not understand
> namespaces.  Newer ones, like SAX 2.0, do.
>
> By default a document does not have a namespace.  For example,
>
> <person name="Andrew" />
>
> has no namespace.
>
> To declare a default namespace use the 'xmlns' attribute.  All
> attributes which start 'xml' or are in the 'xml:' namespace are
> reserved.
>
> <person name="Andrew" xmlns="http://www.biodas.org/" />
>
> This is the name 'person' in the namespace 'http://www.biodas.org/'.
> The namespace is an opaque identifer.  It leverages URIs in part
> because it's much easier to guarantee uniqueness.
>
> The combination of (namespace, tag name) is unique.  The tag
> name is also called the "local name".
>
> That's to distinguish it from a "qualified name", also called
> a "qname".  These look like
>
> <abc:person name="Andrew" xmlns:abc="http://www.biodas.org/" />
>
> This element has identical meaning to the previous element
> using the default namespace.  It's qname is 'abc:person' but
> the full name is the tuple of
>
>     ("http://www.biodas.org/", "person")
>
> For notational convenience this is sometimes written in Clark
> notation, as
>    {http://www.biodas.org}person
>
>    Element                                     Clark notation
> <person />                                      person
> <person xmlns="" />                             {}person
>                             ("empty namespace" is different than "no
> namespace")
>
> <person xmlns="http://biodas.org/" />
> {http://biodas.org/}person
> <das:person xmlns:das="http://biodas.org/" />
> {http://biodas.org/}person
> <X:person xmlns:X="http://biodas.org/" />
> {http://biodas.org/}person
>
> The prefix used doesn't matter.  Only the combination of
>    (namespace, local name)
> is important.  The Clark notation string captures that as a single
> string,
> which is much easier when doing comparisons.
>
> For example, if you try the dasypus verifier at
>
> http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/
> genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features
>
> one of the output messages is
>
> Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES'
> but
> got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line
> 3, column 2
>
> This shows the Clark name for the elements, indicating that the root
> element has a different namespace and local name from what Dasypus
> expects.
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Wed Mar 15 10:53:11 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:53:11 -0800
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
	<9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk>
Message-ID: <0e5d03e0bc2f9ab791a891f058ca664b@dalkescientific.com>

Andreas (and Thomas)
>> genome.cbs.dtu.dk:9000/das/tmhmm/
>> genome.cbs.dtu.dk:9000/das/netoglyc/
> all these servers match to the DAS 1 spec which says that the second 
> to last bit
> is "das" and the last bit is the "data source name".
> The registry contains a check for that.

Ahh, right.  I misremembered and thought that "/das" had to
be immediately after the hostname.  Looking now there can be
an arbitrary prefix.

What I remembered was the servers at http://das.bcgsc.ca:8080/das
which don't have regular names.

Then again, they have nearly bit-rotted away.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 11:04:38 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 08:04:38 -0800
Subject: [DAS2] XML namespaces
In-Reply-To: <200603151549.41773.lstein@cshl.edu>
References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>
	<200603151549.41773.lstein@cshl.edu>
Message-ID: <2de39a4a831f6a06c408bdf31ef2a41f@dalkescientific.com>

Linconl:
> BTW, if a namespace tag is reused in an inner scope with a
> different
>
> 	<das:name xmlns:das="http://foo.bar/das" />
> 		<das:first>Andrew</das:first>
> 		<das:middle 
> xmlns:das="http://addresses.com/address/2.0">K.</das:middle>
>                 <das:last>Dalke</das:last>
>           </das:name>
>
> I put middle into namespace http://addresses.com/address/2.0 and put 
> first and
> last into namespace http://foo.bar.das.
>
> This is the correct scoping behavior, right?

Yes.  I tested it with an XML process and it says the following is
equivalent (after fixing a typo).

<ns0:name xmlns:ns0="http://foo.bar/das">
   <ns0:first>Andrew</ns0:first>
   <ns1:middle 
xmlns:ns1="http://addresses.com/address/2.0">K.</ns1:middle>
   <ns0:last>Dalke</ns0:last>
</ns0:name>

BTW, it should be "P." :)

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 10:58:15 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:58:15 -0800
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
	<58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk>
Message-ID: <a3848d3aa72cdf82472f36c3ba8093ab@dalkescientific.com>

Thomas:
> The registry records datasources, not server installations.  In 
> general, I'm not sure a server installation is a terribly 
> "interesting" object, since it's quite possible that one server 
> installation will host many datasources with little or no semantic 
> connection between them -- the only thing they have in common is that 
> they're hosted at the same site.

I agree.

The only thing that's interesting about the server installation is
knowing who is in charge when it goes down.  :)

That's found from the MAINTAINER element at the <SOURCES> level of
the sources document.


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Wed Mar 15 11:37:51 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Wed, 15 Mar 2006 08:37:51 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day two, 14 Mar 2006
Message-ID: <C03D82DF.1D05A%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day two, 14 Mar 2006

$Id: das2-teleconf-2006-03-14.txt,v 1.1 2006/03/15 16:47:50 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E.
  Sanger: Andreas Prlic, Thomas Down
  Dalke Scientific: Andrew Dalke (at Affy)
  UC Berkeley: Nomi Harris (at Affy)
  UCLA: Allen Day (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Agenda:
----------
See Andrew's email. Here's a summary.

* segment ids
* coord systems and how to handle

[Gregg is out, Andrew is leading the teleconf.]

ap: ad proposed changes re: coords and capabilities i think is not
really needed. the question is do annotation servers need to provide to
link to reference servers back. If the link is apparent from, c

ad: summary: moving coord element inside capabilities element (one
part of 4 things mentioned). the reason: coords and capabilities are
tied together. They refer to the same thing. E.g., you need know which
of the segments are tied to which coords.

ap: annotation server does need to, it can find the reference server
by the coordinates.

ad: if you have local coords, and you want to point to a local server,
how do you specify that this segment corresponds to these coords.
ap: you should have a reference server that speaks the coords you want
to annotate.
td: if you have your own assembly you have your own coord system,
ad: yes, and i set up my own ref server for it.

ad: if I have mult coords, won't I have multiple segments? isn't there
a 1:1 relationship between coords and segments?
ap: I think many:many.... wait
td: each segment is a member of one coord system, a coord system
contains many segments.
ad: andreas has features, some annotated on scaffold, some annotated
on chromosome. So, you need the ability to have two segments provided
by server.
ap: coords should contain segment capabilities, i.e., the other way
around.

ad: proposing to have a uri to id the coords, capapbility should have
a field to say the coord uri is 'this'
mailed out the idea to have a unique identifier for coords.
keep them separate now, have the ability
sc: optional?
ad: yes only needed if you have mult coord systems.

ad: like features and feature type. segment is saying it's of that
type

ad: will add optional id to the capability, so that you can figure out
what the segments are.

in proposal this am,
1) timestamp to coord info (optional) -- use case: sort by most recent
coord system for a given build.
2) unique id for the coord (

ap: this will be useful for searches as well. can request only results
from a particular coord system. (see email discussion this am)
td: server alignment btwn human and mouse, you can say whether you are
referencing human or mouse just by specifying coord system.
ad: also two different human assemblies.

ap: I have to leave now.

Topic: Segment identifiers email

td: segment had a name and url form id so that feature server doesn't
have to give a concrete url for the seq of chrm22, nice for
lightweight server sans sequence. getting rid of ability to reference
sequence by name instead of url breaks this. You need a concrete url
if you just want to serve features on a sequence.
You end up having to rewrite urls rather than saying this feature is
attached to chr22 in xxx coord system.

ad: one thing gregg and I discussed, the fact that url is by itself an
opaque id, you have to resolve it someway, http, or something else
too. You can use any mechanism you want to turn the name you want.
ad: in segments list, if you have your own local copy. Your segments
section says my local copy is
td: you need a segments capability. I can't have a server that uses
only features capabilities.
ad: if you have your own segments.
if all your features are described using standard names/ids, no you
don't need a segments capability.
td: ok, my assembly is human build 35, and feature lives on chr22.
ad: yes. every place you see optional alias attribute link back to
primary id of segment, that id can be anything.
td: arbitrary string scoped by the coord system, which now has a uri
id string.
ad: yes. and it's also globally unique, not scoped just by coord
system .

td: I don't see what's wrong with ....
ad: we were discussing yesterday having diff names for the same
chromosome. chrI vs chr1.
td: that can be addressed using aliases
ad: alias of field provides a synonym table for what you map locally
to a global id. 
td: you're saying the global ids have to be universally unique even
when taken out of the coord system
ad: yes. feat server providing feats from two diff coord systems, you
need a way to distinguish one segment from another segment, in a
global sense.
td: I don't totally understand cases involving mult coord systems. How
do I find out which of three possible coord systems a given segment
came from?
ad:
td: all clones in embl system. could be a lot.
ad: your client will have to know how to look up the right one.
if you have one coord system that has all your clones, you have to do
the look up anyway to know where to display the features from the
various clones.
td: suppose looking for gene names: you get back a feature on clone
AL19823. I want to start from that feature and build a meaningful
display. So  I need to work out what coord system this feature lives
on. If my server speaks multiple coord systems, one for all embl
accessions and gi ids, I have to test for membership in the set.
My server could put the coord system id on each feature. Would be
optional for servers only attached to one coord system.

ad: right. Andreas also wants coord uri part of feature filter. Could
add it to the feature filter.
td: yes. give me all genes called xyz. Do you always want to limit to
one coord system?
ad: I see your point. Having to search

ad: New thing called title for humans to read.
Also proposed inside, overlaps, contains so they don't

td: to avoid a nastiness in query lang, I like that. Removes an issue
that scares me about having urls in the query. pathological case:
client has a good reason to retrieve features on part of a two
sequences that have lots of features on. e.g., all cutting sites for
all restriction enzymes. Very high density. If the genome is made of
10kb clones, the user may want to get features that span clone
boundaries. server may do lots of extra fetching that's not really
necessary. 
ad: it's the number of requests that's the issue, same amout of
info. so it's an issue of network overhead.

advantage: makes servers easier to implement since it eliminates
searching partial regions. Some use cases exists, but can be done on
the client side. 
td: seems a shame to lose the capability, but not a huge loss.
the alternative would be to say that you parse the query string left
to right. overlaps=5000-10000; ... puts limits on how server parses.
ad: or we propose a new query interface

ad: this sounds like I should go ahead with segment ids.

ad: using uri vs id (internal link id vs link to something else)
td: seems to be enough impl-breaking changes, not a big argument
either way.
ad: enough changes going on now, but probably won't change much more.
td: if you want to make a small change that's quick to implement, no
objections. Also fine with using id, since all dom stuff about id
refers to things marked id in the scheme, not attrib names. Changing
to uri, won't cause much effect.
nh: like a gobal replace.
ad: in general there's been lots of changes, want people to get
clients/servers going.
ad: spec writing is going slow, would like to show examples that
people can use.
nh: feature parsing can use canned examples.
aday: would prefer to have spec written, trouble with ambiguity
ad: you need to impl before you can figure out how to write it.
nh: server people need full spec, client can use examples

ad: previous slow going since lincoln had little time to work on it.
aday: would like a snapshot, version number. impl after last code
sprint.
nh: don't have time to work on das after this. will just break when/if
allen's server changes.
This just happens when working on developing spec.

ad: the idea is to get code and examples up today.
td: waiting for spec to stabilize a bit.
ad: changes made this week won't have major impact on people's work in
UK?
td: no.

nh: can you provide a changes document?
ad: those would be my emails. a pain.

nh: registry, I was suprised to find a versioned sources in it. won't
there be an explosion of org x versions x server. It provides
convenience
td: as long as it's not thousands and thousands of data sources, it
won't be a problem.
ad: 2k per server x 1000 servers, = 2M
td: if it gets to point where retrieving whole registry is a problem,
we could add capability to restrict what you get.
nh: need human-friendly title for each data source.
would be nice if that explained more to the person who was choosing
that data source (e.g., date).
ad: Andreas' system (web-based) has a description.


Status reports
--------------

sc: adding more data to affy das server, working on building
das2_server code recently checked into genoviz code base by
gregg. Then will work on setting it up on a publically accessible
server at affy.

ee: will be working on style sheets in igb.

aday: spent time on setting up dev environment since laptop died
yesterday. 

bo: got food poisoning -- bad pizza?, was up till 4am.

td: not much das-related stuff yet.


From Steve_Chervitz at affymetrix.com  Wed Mar 15 16:24:59 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Wed, 15 Mar 2006 13:24:59 -0800
Subject: [DAS2] New affymetrix das/2 development server
Message-ID: <C03DC62B.1D090%Steve_Chervitz@affymetrix.com>


Gregg's latest spec-compliant, but still development-grade, das/2 server is
now publically available via http://205.217.46.81:9091

It's currently serving annotations from the following assemblies:
- human hg16 
- human hg17 
- drosophila dm2

Send me requests for any other data sources that would help your development
efforts.

Example query to get back a das-source xml document:
http://205.217.46.81:9091/das2/genome/sequence

It's compliance with the spec is steadily improving, on a daily if not
hourly basis during the code sprint.

Within IGB you can access this server from the DAS/2 servers tab
under 'Affy-temp'.

You'll need the latest version of IGB from the CVS repository at
http://sf.net/projects/genoviz

Steve


From dalke at dalkescientific.com  Wed Mar 15 16:25:53 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 13:25:53 -0800
Subject: [DAS2] on local and global ids
Message-ID: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>

The discussion today was on local segment identifiers vs. global
segment identifiers.

I'm going to characterize them as "abstract" vs. "concrete"
identifiers.  An abstract id has no default resolution to a
resource.  A concrete one does.

The identifier "http://www.biodas.org/" is concrete identifier
because it has a default resolver. "lsid:ncbi:human:35" is an
abstract identifier because it has no default resolver (though
there are resolvers for lsid they are not default resolvers.)

The global segment identifier may be a concrete identifier.  It
may implement the segments interface.  But who is in charge of
that?  Who defines and maintains the service?  If it goes down,
(power outage, network cable cut) then what does the rest of
the world do?

For the purposes of DAS it is better (IMO) that the global
identifiers be abstract, though they should be http URLs which
are resolvable to something human readable.  (This is what
the XML namespace elements do.)

Reference servers are concrete identifiers.  They exist.  They
can change (eg, change technologies and change the URLs, say
from cgi-bin/*.pl to an in-process servlet.)  Now, they should
be long-lived, but that's not how life works.

Suppose someone wants to set up an annotation server, without
setting up a reference server.  One solution is to point to
an existing reference server.

<SOURCES>
  <SOURCE>
   <VERSION>
    <CAPABILITY type="segments" 
uri="http://some/remote/reference/server" />
    <CAPABILITY type="features" uri="features.cgi" />
    <CAPABILITY type="types" uri="types.xml" />
   </VERSION>
  </SOURCE>
</SOURCES>

In this case all the features are returned with segments labeled
as in the reference server.  There's no problem.

Second, Andreas wants an abstract "COORDINATE" space id

<SOURCES>
  <SOURCE>
   <VERSION>
    <COORDINATES uri="http://some/arbitrary/coordinate/id" 
authority="NCBI"
       version="35" .... />
    <CAPABILITY type="features" uri="features.cgi" />
    <CAPABILITY type="types" uri="types.xml" />
   </VERSION>
  </SOURCE>
</SOURCES>

This requires a more complicated client because it must have other
information to figure out how to convert from the coordinate identifier
into the corresponding types.

The answer that Andreas and others give is "consult the registry".
That is, look for other other segments CAPABILITY elements with
the same coordinates id.  For that to happen there needs to be a
way to associate a segments doc with a coordinate system.  For example,
this is what the current spec allows (almost - there's no example
of it and I'm still trying to get the schema working for it)

<SOURCES>
  <SOURCE>
   <VERSION>
    <COORDINATES uri="http://some/arbitrary/coordinate/id" 
authority="NCBI"
       version="35" .... />
    <CAPABILITY type="segments" uri="features.cgi"
             coordinates="http://some/arbitrary/coordinate/id" />
   </VERSION>
  </SOURCE>
</SOURCES>


This makes a resolution scheme from an abstract coordinate identifier
into a concrete segments document identifier.

Why are there so many fields on the coordinates?  It could be 
normalized,
so you fetch the coordinate id to get the information.  It's there
to support searches.  A goal has been that the top-level sources 
document
gives you everything you need to know about the system.

(Doesn't mean it's elegant.  I won't talk about alternatives.  It's
not important.  There's at most an extra 150 or so bytes per versioned
source.)


The problem comes when a site wants a local reference server.
These segments have concrete local names.

DAS1 experience suggests that people almost always set up local
servers.  They do not refer to an well-known server.

There are good reasons for doing this.  If the local annotation
server works then the local reference server is almost certain
to work.  The well-known server might not work.

Also, the configuration data is in the sources document.  There's
no need to set up a registry server to resolve coordinates.  There's
no configuration needed in the client to point to the appropriate
concrete identifier given an abstract URL.

My own experience has been that people do not read specifications.
I am an odd-ball.  According to

    http://diveintomark.org/archives/2004/08/16/specs

I am an asshole.  That's okay -- most people are morons.

> Morons, on the other hand, don?t read specs until someone yells at 
> them. Instead, they take a few examples that they find ?in the wild? 
> and write code that seems to work based on their limited sample. Soon 
> after they ship, they inevitably get yelled at because their product 
> is nowhere near conforming to the part of the spec that someone else 
> happens to be using. Someone points them to the sentence in the spec 
> that clearly spells out how horribly broken their software is, and 
> they fix it.

Someone who wants to implement a DAS reference server will
take the data from somewhere and make up a local naming scheme.

That's what happened with DAS1.  That's why Gregg was saying
he maintains a synonym table saying human
    1 = chr1 = Chromo1 = ChrI
    2 = chr2 = Chromo2 = ChrII

This will not change.  People will write a server for local data
and point a DAS client at it.  The client had better just work
for the simple case of viewing the data even through there is
no coordinate system -- it needs to, because people will work on
systems with no coordinate system.

Sites will even write multiple in-house DAS servers providing
data, which work because everything refers to the same in-house
reference server.

It's only the first time that someone wants to merge in-house
data with external data that there's a problem.  This might be
several months after setting up the server.  At that point they
do NOT want to rewrite all the in-house servers to switch to
a new naming scheme.

That's why the primary key for a paired annotation server and
feature must be a local name.  That's what morons will use.
Few will consult some global registry to make things interoperable
at the start.

> For example, some people posit the existence of what I will call the 
> ?angel? developer. ?Angels? read specs closely, write code, and then 
> thoroughly test it against the accompanying test suite before shipping 
> their product. Angels do not actually exist, but they are a useful 
> fiction to make spec writers to feel better about themselves.

Lincoln could come up with universal names for every coordinate
system that ever existed or will exist.  But people will not
consult it.

However, they will when there is a need to do that.  The need comes
in when they want to import external data.  At that point they need
a way to join between two different data sources.

They consult the spec and see that there's a "synonym" (or "reference",
or "global", or "master" or *whatever* name -- I went with synonym
because it doesn't imply that it's the better name.)


  <SEGMENT uri="segment/chrI" title="Chromosome I" length="230209"
      synonym="http://dalkescientific.com/yeast1/ChrI" />

The local name <xml:base> + "segment/ChrI" is also known as
http://dalkescientific.com/yeast1/ChrI .  Simple, and requires
very little change in the server code.

The only other change is to support the synonym name when
doing segment requests, as
   segment=http://dalkescientific.com/yeast1/ChrI

This is important because then clients can make range requests
from servers without having to download the segment document first.
It's also easy to implement, because it's a lookup table in the
web server interface, and not something which needs to be in
the database proper.

Most people are morons.  The spec as-is is written for that.
It's not written for angels.  It allows post-facto patch-ups once
people realize they need a globally recognized name.

It does require smarter clients.  They need to map from local
name to global name, through a translation table provided by
the server.  This is fast and easy to implement.  It's easier
to implement than consulting multiple registry servers and
trying to figure out which is appropriate.

And the XML returned will be smaller.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 17:39:36 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 14:39:36 -0800
Subject: [DAS2] xml namespace uri
Message-ID: <f5afcfa23b8826ee042fa81e5c4bc57a@dalkescientific.com>

Please use

   "http://biodas.org/documents/das2"

for the XML element namespace.

The two current servers (Allen's and Steve's) use

   "http://www.biodas.org/ns/das/2.00"

which is wrong according to the spec, for the last 2 years it's been

   "http://www.biodas.org/ns/das/genome/2.00"

Since the servers need to change anyway, might as well make it
something a bit more readable, and shorter.  :)

I've checked all the current dasypus (validator) software into
CVS, btw, and updated all of the example xml (draft3/ucla/) to
use the new namespace.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 00:17:24 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 21:17:24 -0800
Subject: [DAS2] query language description
Message-ID: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>

The query fields are

   name      |  takes | matches features ...
  ==========================
   xid       |  URI   | which have the given xid
   type      |  URI   | with the given type or subtype (XX keep this  
one???)
   exacttype |  URI   | with exactly the given type
   segment   |  URI   | on the given segment
   overlaps  | region | which overlap the given region
   inside    | region | which are contained inside the given region (XX  
needed??)
   contains  | region | which contain the given region  (XX needed?? )
   name      | string | with a name or alias which matches the given  
string
   prop-*    | string | with the property "*" matching the given string

Queries are form-urlencoded requests.  For example, if the features
query URL is 'http://biodas.org/features' and there is a segment named
'http://ncbi.org/human/Chr1' then the following is a request for all the
features on the first 10,000 bases of that segment

The query is for
     segment = 'http://ncbi.org/human/Chr1'
     overlaps = 0:10000

which is form-urlencoded as

    
http://biodas.org/features? 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000

Multiple search terms with the same key are OR'ed together.  The  
following
searches for features containing the name or alias of either
BC048328 or BC015400

   http://biodas.org/features?name=BC048328;name=BC015400

Multiple search terms with different keys are AND'ed together,
but only after doing the OR search for each set of search terms with
identical keys.  The following searches for features which have
a name or alias of BC048328 or BC015400 and which are on the segment
http://ncbi.org/human/Chr1

    
http://biodas.org/features?name=BC048328; 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400

The order of the search terms in the query string does not affect
the results.

If any part of a complex feature (that is, one with parents
or parts) matches a search term then all of the parents and
parts are returned.  (XXX Gregg -- is this correct? XXX)


The fields which take URLs require exact matches.

I think we decided that there is no type inferencing done in
the server; it's a client side thing.  In that case the 'type'
field goes away.  We can still keep 'exacttype'.  The URI
used for the matching is the type uri, and NOT the ontology URI.

(We don't have an ontology URI yet, and when we do we can add
an 'ontology' query.)

The segment URI must accept the local identifier.  For
interoperability with other servers they must also accept the
equivalent global identifier, if there is one.

If range searches are given then one and only one segment is
allowed.  Multiple segments may be given, but then ranges are not
allowed.

The string searches support a simple search language.
     ABC  -- contains a word which exactly matches "ABC" (identity, not  
substring)
    *ABC  -- words ending in "ABC"
     ABC* -- words starting with "ABC"
    *ABC* -- words containing the substring "ABC"

If you want a field which exactly contains a '*' you're kinda
out of luck.  The interpretation of whitespace in the query or
in the search string is implementation dependent.  For that
matter, the meaning of "word" is implementation dependent.  (Is
*O'Malley* one word? *Lethbridge-Stewart*?)

When we looked into this last month at Sanger we verified that
all the databases could handle %substring% searches, which was
all that people there wanted.  The Affy people want searches for
exact word, prefix and suffix matches, as supported by the the
back-end databases.


   XXX CORRECT ME XXX

The 'name' search searches.... It used to search the 'name'
attribute and the 'alias' fields.  There is no 'name' now.  I
moved it to 'title'.  I think I did the wrong thing; it should
be 'name', but it's a name meant for people, not computers.

Some features (sub-parts) don't have human-readable names so
this field must be optional.


The "prop-*" is a search of the <PROP> elements.  Features may
have properties, like

    <PROP key="cellular_component" value="membrane" />

To do a string search for all 'membrane' cellular components,
construct the query key by taking  the string "prop-" and
appending the property key text ("cellular_component").  The
query value is the text to search for.

     prop-cellular_component=membrane

To search for any cellular_component containing the substring "mem"

     prop-cellular_component=*membrane*

The rules for multiple searches with the same key also apply to the
prop-* searches.  To search for all 'membrane' or 'nuclear'
cellular components, use two 'prop-cellular_component' terms, as

      
http://biodas.org/features?prop-cellular_component=membrane;prop- 
cellular_component=membrane


The range searches are defined with explicit start and end
coordinates.  The range syntax is in the form "start:end", for
example, "1:9".

Let 'min' be the smallest coordinate for a feature on a given
segment and 'max' be one larger than the largest coordinate.
These are the lower and upper founds for the feature.

An 'overlaps' search matches if and only if
    min < end AND max > start

XXX For GREG XXX

What do 'inside' and 'contains' do?  Can't we just get
away with 'excludes', which has complement of 'overlaps'?
Searches are done as:
   Step 0) specify the segment
   Step 1) do all the includes  (if none, match all features on segment)
   Step 2) do all the excludes, inverted (like an includes search)
   Step 3) only return features which are in Step 1 but not
       in Step 2)
   Step 4) ...
   Step 5) Profit!

I think this will support your smart code, and it's easy
enough to implement.

Every one but you was planning to use 'overlaps'.  Only you
wanted to use 'inside'.  Anyone want to use 'contains'?

					Andrew
					dalke at dalkescientific.com


From td2 at sanger.ac.uk  Thu Mar 16 04:24:03 2006
From: td2 at sanger.ac.uk (Thomas Down)
Date: Thu, 16 Mar 2006 09:24:03 +0000
Subject: [DAS2] on local and global ids
In-Reply-To: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
Message-ID: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>


On 15 Mar 2006, at 21:25, Andrew Dalke wrote:
>
> The problem comes when a site wants a local reference server.
> These segments have concrete local names.
>
> DAS1 experience suggests that people almost always set up local
> servers.  They do not refer to an well-known server.

I'm not sure that DAS1 experience is a good model for this.  It's  
true that people didn't always point to well-known reference servers,  
but I think this has more to do with the fact that people didn't know  
which server to point to.  Some people did set up their own reference  
servers.  Many didn't, and many of those didn't give a valid  
MAPMASTER URL at all.  This situation didn't actually cause too much  
trouble since a lot of these users just wanted to add a track to  
Ensembl -- which doesn't care about MAPMASTER URLs and just trusts  
the user to add tracks that live in an appropriate coordinate system.

I'd still argue that the majority -- probably the vast majority -- of  
people setting up DAS servers really just want to make an assertion  
like "I'm annotating build NCBI35 of the human genome" and be done  
with it.  That's what the coordinate system stuff in DAS/2 is for.   
If this is documented properly I don't think we'll see many "end- 
user" sites setting up their own reference servers unless a) they  
want an internal mirror of a well-known server purely for performance/ 
bandwidth reasons or b) they want to annotate an unpublished/new/ 
whatever genome assembly.

(Actually, some of the "annotation providers set up their own  
reference servers" stuff might be my fault -- early versions of  
Dazzle were pretty strict about requiring a valid [and functional!]  
MAPMASTER for every datasource, so this pushed people towards setting  
up reference servers.)

                 Thomas.


From lstein at cshl.edu  Thu Mar 16 06:03:49 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu, 16 Mar 2006 11:03:49 +0000
Subject: [DAS2] on local and global ids
In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
Message-ID: <200603161103.50323.lstein@cshl.edu>

I think it will help considerably to have a document that lists the valid 
sequence IDs for popular annotation targets. I've spoken with Ewan on this, 
and Ensembl will generate a list of IDs for all vertebrate builds. I'll take 
responsibility for creating IDs for budding yeast, two nematodes and 12 
flies.

Lincoln

On Thursday 16 March 2006 09:24, Thomas Down wrote:
> On 15 Mar 2006, at 21:25, Andrew Dalke wrote:
> > The problem comes when a site wants a local reference server.
> > These segments have concrete local names.
> >
> > DAS1 experience suggests that people almost always set up local
> > servers.  They do not refer to an well-known server.
>
> I'm not sure that DAS1 experience is a good model for this.  It's
> true that people didn't always point to well-known reference servers,
> but I think this has more to do with the fact that people didn't know
> which server to point to.  Some people did set up their own reference
> servers.  Many didn't, and many of those didn't give a valid
> MAPMASTER URL at all.  This situation didn't actually cause too much
> trouble since a lot of these users just wanted to add a track to
> Ensembl -- which doesn't care about MAPMASTER URLs and just trusts
> the user to add tracks that live in an appropriate coordinate system.
>
> I'd still argue that the majority -- probably the vast majority -- of
> people setting up DAS servers really just want to make an assertion
> like "I'm annotating build NCBI35 of the human genome" and be done
> with it.  That's what the coordinate system stuff in DAS/2 is for.
> If this is documented properly I don't think we'll see many "end-
> user" sites setting up their own reference servers unless a) they
> want an internal mirror of a well-known server purely for performance/
> bandwidth reasons or b) they want to annotate an unpublished/new/
> whatever genome assembly.
>
> (Actually, some of the "annotation providers set up their own
> reference servers" stuff might be my fault -- early versions of
> Dazzle were pretty strict about requiring a valid [and functional!]
> MAPMASTER for every datasource, so this pushed people towards setting
> up reference servers.)
>
>                  Thomas.
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Thu Mar 16 06:06:38 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu, 16 Mar 2006 11:06:38 +0000
Subject: [DAS2] Spec freeze
In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
Message-ID: <200603161106.39074.lstein@cshl.edu>

Hi,

I just spoke with Thomas and Andreas on this, and all three of us are experiencing difficulty coding to a changing spec. In my opinion the spec is really good right now and issues such as whether to use "uri" or "id" as attribute names are not germaine. Can I propose that we declare a three month spec freeze starting at midnight tonight (GMT)?

Lincoln


-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Thu Mar 16 10:38:00 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 07:38:00 -0800
Subject: [DAS2] on local and global ids
In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
Message-ID: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>

Thomas:
> I'm not sure that DAS1 experience is a good model for this.  It's true 
> that people didn't always point to well-known reference servers, but I 
> think this has more to do with the fact that people didn't know which 
> server to point to.

I think I said there are two cases; there's actually several

  1. the sources document states a well-known COORDINATES
       and makes no links to segments
  2. the sources document refers to a well-known segments server
       ("the" reference server) and no COORDINATES
  3. the source document has a segments document, and each segment
      listed uses URIs from "the" reference server
  4. the server implements its own coordinates server, with
      new segment ids
  5. When uploading a track to Ensembl there's no need to have
      either COORDINATE or segments -- the upload server can
      verify for itself that the upload uses the right ids.


The *only* concern is with #4.  Everything else uses the well-known
global identifier for segments.

> I'd still argue that the majority -- probably the vast majority -- of 
> people setting up DAS servers really just want to make an assertion 
> like "I'm annotating build NCBI35 of the human genome" and be done 
> with it.

I'm fine with that.  There are two ways to do it.  #1 and #2 above.
In theory only one of those is needed.   The document can point to
"the" reference server for NCBI 35.

In practice that's not sufficient because there is no authoritative
NCBI 35 server.

Hence COORDINATES provides an abstract global identifier describing
the reference server.

>   That's what the coordinate system stuff in DAS/2 is for.  If this is 
> documented properly I don't think we'll see many "end-user" sites 
> setting up their own reference servers unless a) they want an internal 
> mirror of a well-known server purely for performance/bandwidth reasons 
> or b) they want to annotate an unpublished/new/whatever genome 
> assembly.

A philosophical comment.  I'm a distributed, self-organizing kinda
guy.  I don't think single root centralized systems work well when
there are many different groups involved.

I think many people will use the registry server, but not all.
I think there will be public DAS servers which aren't in the registry.
I know there will be in-house DAS servers which aren't.

I'm just about certain that some sites will have local copies of
the primary data.  They do for GenBank, for PDB, for SWISS-PROT,
for EnsEMBL.  Why not for DAS?

That said, here's a couple of questions for you to answer:

   a) When connecting to a new versioned source containing only
COORDINATES data, what should the client do to get the list
of segments, sizes, and primary sequence?

I can think of several answers.  My answer is that the versioned
source should state the preferred reference server and unless
otherwise configured a client should use that reference server
and only that reference server.

Yes, all the reference servers for that coordinate system
are supposed to return the same results.  But that's only if
they are available.  There are performance issues too, like
low bandwidth or hosting the server on a slow machine.  The
DAS client shouldn't round-robin through the list until it
finds one which works because that could take several minutes
to timeout on a single server, with another 10 to try.

Yes, a client can be configured and told "for coordinate
system A use reference server Z".  But that's a user
configuration.

   b) If there is a local mirror of some reference server, how
should the local DAS clients be made aware of it?  (And
should this be a supportable configuration? I think so.)

I'm pretty sure that most DAS clients won't be configurable
to look for local servers instead of global ones.  Even if
they are, I'm pretty sure each will have a different way
to do so.  Apollo and Bioperl will use different mechanisms.

I have no good answer for this.  It sounds like your answer
is "people won't have local copies."  I think they will.

Ideas:
   - have a rewriting registry server which does a rewrite of
the information from the other servers.  But this doesn't
work because the feature result from the remote server (in
my scheme) is given using its local segment names.  There's
no way to go from that local name to the appropriate mirror
reference server.  This suggests that the results really do
need to be given through global ids, with no support for
local ones.  The segments result optionally provides a way
to resolve a global name through a local resource.

   - set up an HTTP proxy service for DAS requests which
transparently detects, translates and redirects to the
appropriate local resource.  Cute, but not likely to be
done in real life.

   c) A group has been working on a new genome/assembly.  The
data is annotated on local machines using DAS and DAS writeback
Finally it's published.  Do they need to rewrite all their
segment identifiers to use the newly defined global ones?

As there are only a few places where the segment identifier is
used, and it's an interface layer, I think the conversion is
easy.  But it is a flag day event which means people don't
want to do it.  Instead, it's more likely that local people
will set up a synonym table to help with the conversion.

There are perhaps a dozen groups which might do this and they
all have competent people.  This should not be a problem.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 11:06:26 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 08:06:26 -0800
Subject: [DAS2] on local and global ids
In-Reply-To: <200603161103.50323.lstein@cshl.edu>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
	<200603161103.50323.lstein@cshl.edu>
Message-ID: <f35df2e5087a0027594a6dd2cd1d9a28@dalkescientific.com>

Lincoln:
> I think it will help considerably to have a document that lists the 
> valid
> sequence IDs for popular annotation targets. I've spoken with Ewan on 
> this,
> and Ensembl will generate a list of IDs for all vertebrate builds. 
> I'll take
> responsibility for creating IDs for budding yeast, two nematodes and 12
> flies.

What should people use if there aren't defined?  Like now?

If everyone must use the same well-defined global id for the features
response then doesn't that mean we can't have any DAS servers
until this document is made?

Is the general requirement that the first person to make a server for
a given build/genome/etc. is the one who gets to define the
global ids?  Or is it Andreas at Sanger who defines the names?

Suppose one group in California starts defining names for, say,
the barley genome.  Another group in say, Germany, is also working
on the barley genome.  They hate each others guts and don't work
together, so they make their own names.  The names refer to the
same thing because it was a group in Japan which produced the
genome.  Do we wait for an alignment service?  An identity service?
before people can merge data from these two groups?

Maybe we can solve all this by having an identity mapper format.
And defer defining that format until there is a problem.

There is no perfect solution.  This is a sociological problem.

Gregg's current client, I think, used hard-coded knowledge about the
mapping between the two current servers.  Then again, his code
already supports a synonym table.


					Andrew
					dalke at dalkescientific.com


From gilmanb at pantherinformatics.com  Thu Mar 16 10:52:51 2006
From: gilmanb at pantherinformatics.com (Brian Gilman)
Date: Thu, 16 Mar 2006 10:52:51 -0500
Subject: [DAS2] on local and global ids
In-Reply-To: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
	<41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>
Message-ID: <441989D3.90202@pantherinformatics.com>

Hey Guys,

    Where's the latest spec and use case document? Sorry if this is a 
super dumb question. I couldn't find it on the website.

                               Best,

                                        -B

Andrew Dalke wrote:

>Thomas:
>  
>
>>I'm not sure that DAS1 experience is a good model for this.  It's true 
>>that people didn't always point to well-known reference servers, but I 
>>think this has more to do with the fact that people didn't know which 
>>server to point to.
>>    
>>
>
>I think I said there are two cases; there's actually several
>
>  1. the sources document states a well-known COORDINATES
>       and makes no links to segments
>  2. the sources document refers to a well-known segments server
>       ("the" reference server) and no COORDINATES
>  3. the source document has a segments document, and each segment
>      listed uses URIs from "the" reference server
>  4. the server implements its own coordinates server, with
>      new segment ids
>  5. When uploading a track to Ensembl there's no need to have
>      either COORDINATE or segments -- the upload server can
>      verify for itself that the upload uses the right ids.
>
>
>The *only* concern is with #4.  Everything else uses the well-known
>global identifier for segments.
>
>  
>
>>I'd still argue that the majority -- probably the vast majority -- of 
>>people setting up DAS servers really just want to make an assertion 
>>like "I'm annotating build NCBI35 of the human genome" and be done 
>>with it.
>>    
>>
>
>I'm fine with that.  There are two ways to do it.  #1 and #2 above.
>In theory only one of those is needed.   The document can point to
>"the" reference server for NCBI 35.
>
>In practice that's not sufficient because there is no authoritative
>NCBI 35 server.
>
>Hence COORDINATES provides an abstract global identifier describing
>the reference server.
>
>  
>
>>  That's what the coordinate system stuff in DAS/2 is for.  If this is 
>>documented properly I don't think we'll see many "end-user" sites 
>>setting up their own reference servers unless a) they want an internal 
>>mirror of a well-known server purely for performance/bandwidth reasons 
>>or b) they want to annotate an unpublished/new/whatever genome 
>>assembly.
>>    
>>
>
>A philosophical comment.  I'm a distributed, self-organizing kinda
>guy.  I don't think single root centralized systems work well when
>there are many different groups involved.
>
>I think many people will use the registry server, but not all.
>I think there will be public DAS servers which aren't in the registry.
>I know there will be in-house DAS servers which aren't.
>
>I'm just about certain that some sites will have local copies of
>the primary data.  They do for GenBank, for PDB, for SWISS-PROT,
>for EnsEMBL.  Why not for DAS?
>
>That said, here's a couple of questions for you to answer:
>
>   a) When connecting to a new versioned source containing only
>COORDINATES data, what should the client do to get the list
>of segments, sizes, and primary sequence?
>
>I can think of several answers.  My answer is that the versioned
>source should state the preferred reference server and unless
>otherwise configured a client should use that reference server
>and only that reference server.
>
>Yes, all the reference servers for that coordinate system
>are supposed to return the same results.  But that's only if
>they are available.  There are performance issues too, like
>low bandwidth or hosting the server on a slow machine.  The
>DAS client shouldn't round-robin through the list until it
>finds one which works because that could take several minutes
>to timeout on a single server, with another 10 to try.
>
>Yes, a client can be configured and told "for coordinate
>system A use reference server Z".  But that's a user
>configuration.
>
>   b) If there is a local mirror of some reference server, how
>should the local DAS clients be made aware of it?  (And
>should this be a supportable configuration? I think so.)
>
>I'm pretty sure that most DAS clients won't be configurable
>to look for local servers instead of global ones.  Even if
>they are, I'm pretty sure each will have a different way
>to do so.  Apollo and Bioperl will use different mechanisms.
>
>I have no good answer for this.  It sounds like your answer
>is "people won't have local copies."  I think they will.
>
>Ideas:
>   - have a rewriting registry server which does a rewrite of
>the information from the other servers.  But this doesn't
>work because the feature result from the remote server (in
>my scheme) is given using its local segment names.  There's
>no way to go from that local name to the appropriate mirror
>reference server.  This suggests that the results really do
>need to be given through global ids, with no support for
>local ones.  The segments result optionally provides a way
>to resolve a global name through a local resource.
>
>   - set up an HTTP proxy service for DAS requests which
>transparently detects, translates and redirects to the
>appropriate local resource.  Cute, but not likely to be
>done in real life.
>
>   c) A group has been working on a new genome/assembly.  The
>data is annotated on local machines using DAS and DAS writeback
>Finally it's published.  Do they need to rewrite all their
>segment identifiers to use the newly defined global ones?
>
>As there are only a few places where the segment identifier is
>used, and it's an interface layer, I think the conversion is
>easy.  But it is a flag day event which means people don't
>want to do it.  Instead, it's more likely that local people
>will set up a synonym table to help with the conversion.
>
>There are perhaps a dozen groups which might do this and they
>all have competent people.  This should not be a problem.
>
>					Andrew
>					dalke at dalkescientific.com
>
>_______________________________________________
>DAS2 mailing list
>DAS2 at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/das2
>
>
>  
>


From dalke at dalkescientific.com  Thu Mar 16 11:33:58 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 08:33:58 -0800
Subject: [DAS2] on local and global ids
In-Reply-To: <441989D3.90202@pantherinformatics.com>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
	<41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>
	<441989D3.90202@pantherinformatics.com>
Message-ID: <24b985c0229970562a9e2612f00f2da5@dalkescientific.com>

Brian:
>    Where's the latest spec and use case document? Sorry if this is a 
> super dumb question. I couldn't find it on the website.

CVS for the spec.  The history is:

draft 1 - written by Lincoln, freeze for summer last year.
This is the one with HTML, etc. and is on the web site.

draft 2 - written by me in January.  In CVS under das/das2/new_spec.txt
with examples under das/das2/scratch . This was the version
for the spring last month

draft 3 - under development
I rewrote beginning of it because no one liked the pedantic
pedagogical style it used.  This draft starts with examples.
The incomplete version, as of Monday morning, is 
das/das2/draft3/spec.txt
However, I am slow at writing spec text, especially new text.
Instead of working on it more I put example output files in
  das/das2/draft3/ucla/
starting with 'sources.xml' in that directory.

As for use cases, the email you saw from me a couple of days
ago is the only thing even close to formal.

					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Thu Mar 16 12:05:10 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Thu, 16 Mar 2006 17:05:10 +0000
Subject: [DAS2] sources responses
Message-ID: <355af8b441fefe8690a9e78de55fc2f9@sanger.ac.uk>

Hi!

the (toy) sources responses at

http://www.spice-3d.org/dasregistry/das1/sources/
http://www.spice-3d.org/dasregistry/das2/sources/

now are updated to the latest spec and validate with Andrew's validator 
at
http://cgi.biodas.org:8080/

Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Steve_Chervitz at affymetrix.com  Thu Mar 16 15:37:16 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 16 Mar 2006 12:37:16 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day three, 15 Mar 2006
Message-ID: <C03F0C7C.1D0F7%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day three, 15 Mar 2006

$Id: das2-teleconf-2006-03-15.txt,v 1.1 2006/03/16 20:45:35 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Sanger: Thomas Down, Andreas Prlic
  CSHL: Lincoln Stein
  Dalke Scientific: Andrew Dalke (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


[Notetaker: joining 10 min into the discussion]

ls: how does synonym business work?
ad: if server has access to data...
ls: we ask server for the global id, uses same global id for segments,
and uses same global id for the sequence.
gh: to do this in the capabilities for annot server, the global id for
segments query points to reference server.
ls: if the local machine current server, has sequence capabilities,
then it passes global id for segments to current server and it gets
the sequence. if it doesn't have that capability, then we need to
figure out a way for it to get the sequence. the easiest way to do
that would be to resolve that url and fetch it. I'm open to any
suggestion. I don't see how this uri/synonym is getting us any closer
to being able to find the server where sequence can be fetched. The
synonym isn't always a fetchable thing.
ad: syn is a global id
ad: look at the uri for the segment and fetch it from there
ls: could be a remote url.
gh: segments query is only thing that gives segment url
segments capabilities for the annot server should point

ls: break apart segments into: id=a string, then have an attribute
seq_url, when fetched returns the seq. returns the bases.
ad: is that's what's there already?
ls: no, uri is an id
ad: every url is an id, but it's up to whim of the server
ls: i don't want people to think its for an id.
want an agreed upon uri identifier, then optionally have a url.
turn synonym into uri, turn uri into resolver
make uri required, bases not required.
ad: additional constraint is 'agreed upon'. what about a group
starts a new sequencing project. There is no globally known uri for it
yet.
ls: they just create their own ids
td: the natural authority is the creator of the assembly.
gh: ncbi won't do it. they don't have a das server, unlikely to.
ls: can point to genome assembly. can create a url that will return
bases from ncbi in a supported format.
this approach will disentangle issue of resolvable vs non-resolvable,
local vs non-local segment ids and how to get segment dna.
gh: I think this will work.

ad: 'this' changing key names?
ls: key semantics
uri is required, global identifier sequence is an optional pointer
gh: you say that for feat xml, the id for seq will be the globally
agreed on id.
ls: yes
ad: if you don't have a local copy, if you have ability to map global
identifiers, then you know what it is from the coordinates.
there are two ways to specificy coordinates: coordinates and segments

ad: if you just need the segments and some identifier.
only when you need to do an overlay with someone else that you
need the coords.
gh: no, coords don't say anything about ids of coord (?)

gh: if we do it the way lincoln proposed, then the logical way to
relate those is that the segments capapbilities points to ref server.
ad: when feat returns a location is it in global or local space?
gh: lincoln - global space

ls: every annot server will know length of its landmarks (chrms).
some people will not want to be served dna, they will point somewhere
else where to get the dna. There will be many places to get dna for a
given global id, they chose one they like.
ls: feature locations are given in global id
ad: this changes the way it's been working. xml:base issues
ls: I know.
gh: if base of sequence and base of features are different, the xml
will get bigger.

ls: so an argument for having local ids is so you can make location
string shorter.
gh: yes.
ls: probably not worth it
ad: also makes it easier to set up a basic server. if you want to
overlay them, yes you do.
ls: you can always set up a local server if you

gh: segments response local and global id as we talked about yesterday
(which one feature locatn is relative to)
gh: if the only way to overlay for a client to know things are in the
same coord system is segid=xxxx and globalid=yyyy, how much harder is
it for server to use global ids.
ls: server can have configuration file to know where its global ids
are coming from

aday: would need to think about it more.
ad: who will set up these identifiers (yeast, human)
ls: I'll do it for model org databases, I will specify segments, and
their dna fetchers and will look up their lengths.
gh: versions?
ls: most recent. community can then keep it up to date.
I bet ensembl will be happy to generate this file automatically with
every build (for vertebrates)

ad: local id uri, and a bunch of synonyms. People will set up own
server not referencing a global system.
ls: then client would do a closure over all systems.
imagine three servers:
server-a says here is my segment
server-b says it can be  b or c
server-c says it can be c or a
so you have to do a join over all servers

gh: not encourage people to do that with local seq ids, encourage
people to use.
need a global referencing system to say this uri is same as that uri.
ad: bad logic for the web. If one is wrong, could be a problem
td: (proposal - based on genomic coord alignments)
ad: that says only alignable things are the same.

ad: don't think it will work, they will already have local servers

gh: what about 'the stick': people who want to register their server
with central registry can only do so if they use global ids for their
segments. 
ls, td: fine
ad: if they've been working for a while in house, they would have a
big effort to retrofit their system to comply. just won't do.

ls: in draft 3, where's assembly info?
ad: same as before. ask segments for agp format. draft not complete.
gh: the thing that ids which assembly you're on is the coordinates
element (authority, taxonomy, ...)
ls: authority is a recognized, globally unique organization. Should it
be a uri?
ad: authority and version is human visible so people can search by
it.
ls: fine.

gh: can invoke the 'stick' idea here: if you 're trying to register
something on same genomome assembly, then registry can check your
segments to verify they are agreed up.
ls: taxon, source, authority, version all must match
ad: also an id
ap: we discussed in email
ad: the only stuff that is complete is in the ucla subdir.
ls: the examples are definitive
ad: yes, unless we change things today.

ls: what if taxon, source, version match but uri doesn't?
registry gets submission. makes a segments request on submitter, if it
gets a list of same segment identifiers, it accepts it. what if it
gets a subset?
gh: ok
ls: superset is not ok.
aday: why?
gh: if you allow subset and superset, you can have everything.
aday: use case: bacteria with extra plasmid identifier.

nh: signing off. will be at affy tomorrow.

ls: you would have to create your own coord system.
gh: could argue with maintainer to added it.
ls: can you have multiple coordinates in a given assembly?
aday: proposal: make coords an attribute of the segment.
could keep your segment references local.

ls: we shouldn't give people ways to create new names. human chr1 ncbi
build 35 should be something that everybody can agree on.
gh: then we wouldn't allow allen's use case where someone wants a
superset of what's in reference?
ls: add new coord tag to source version entry, says I'm creating a
superset consisting of coords from ref 1, 2, 3, any of these can be a
new namespace that I set up.
gh: how do you know which ones come from where?
right now there's now way to get coord for a segment.
ad: can as of yesterday afternoon.

ls: to indicate which segments come from which auth. put coord id into
segments tag. 
aday: thank you!
ad: alternative proposal - multiple segments
use case: when you have scaffolds or chromosomes, or mouse and yeast
ls: say you want human mouse scaffolds + chrms, and human chrms
three diff coords tags in the sources document
each one gives auth, taxon, etc.
when client goes to get segments, it will get human chromosomes, mouse
chrms, and mouse scaffolds, in one big list, each will point back to
coord it got in features requets.

gh: knowing what coordinates doesn't tell you global id for segment
aday: ok.
gh: multiple segments elements vs mult coords in a segment work for
me.
ad: what does a client do
gh: ...
ls: three types of entry points, hu chrms, mo chrms, mo scaffolds, now
tell me what you want to start browsing. human readable.
scaffold on mouse with name xxx from two

ad: displaying all together vs one or the other or the other.

ee: affymetrix use case in igb. [probe

gh: doesn't seem to matter
aday: the tag values are easier to implement
td: not a big difference to me
gh: drawing on whiteboard...

ls: let's rename das to distributed annotation research network. then
we can say "darn1, darn2"!

ad: gregg's request for search to find everything identical (start and
end are same)
td: if you have contained and inside, you can do identical with an and
operation.
ls: doesn't make server any more complicated, for completeness you may
want to do that.
ad: how about includes 1-5000 and excludes ... some of this is asethetic.
ls: overlaps, contains, contained-in have good use cases for.
exact match - maybe searching for curated exons that exactly match
predicted. 

[Lincoln has to leave.]

gh: drawing options for segments and coordinate systems.
[whether you  put a coords tag per segment, or ome capabilities one
for each coord system]
allen's approach - one query with filter or multiple fetches

aday: uniprot example
gh: separate segments query.
ap: can we leave it out and add later if necessary?
ad: these are things that haven't been discussed in last two years
aday: uri

ad: xml namespace issue - what do we call it (see email)
gh: you pick it

ad: required syntax for entry points /das/source
gh: recommended, but not required
ad: lincoln was only one who felt strongly about it being required,
and he's not here.


gh: feature xml, every feature can have multiple locations
feaures can represent alignments (collapsed alignment tag into feature
tag)
td: like it
gh: naive user- given a feat with multip location on genome, represent as
multip locations, or parent child relations
td: don't see as a problem. using parent-child you have things to say
about child features specific to them
gh: genscan prediction,
a problem: one server can serve them up as parent child or as multiple
locations on parent

four child exons in one case
four diff locations in other case

problem is with feat filters. if yo do an overlaps query and any
children meet the condition, you have to return the parent as well and
it's parent on up. agreed?
ad: yes
gh: works fine for parent child, but for multip location situation, if
inside query fully contains only two eons, do you return parent?

td: I'd assume inside query would return both. as long as one exon is
inside the region, the parent is return. define inside as applying to
any level.
gh: so even though the transcript is not inside, you still return it?
td: using the get parent-if-get-children rule
gh: rule must apply to all of them, so you don't get transcript since
it doesn't meet the inside condition.

aday: multiple locations makes sense - just aligned mult times.
human alu feature 100,000s, do you want to create a single feature, or
just a single identifier and put it in many different locations.
ee: that is for alignments not parent-child relationship
aday: you consider location as a attribute of the object..
ee: I agree. alu is only one object, but the exon-transcript are
different
ad: would someone want to annotate the separate exons differently?
aday: you would split it off
ad: eg blast alignment, hsp is part of the conceptual alignment.
gh: in bioperl, some people will go one path, some go the other path,
so we need to figure out how to deal with it.

feat filters is clear for parent child relationship.
aday: inside and overlaps
gh: if your overlap query only grazes one child, you return the
parent. this is the only one I'm certain about.
gh: we haven't specified that the child is within bounds of parent.
with insides, we have a difference of opinion.

one exon is within, do you return it?
ad: most clients  will be doing overlaps, you are the only one doing
insides what do you want?
gh: the multiple locations muddies the issue.
if parent child rule is you only return it if parent is inside (and
recursive parent), I've already optimized for that.
For multiple locations, I can catch that and handle it.
the way I want, the behaviour of mult location will be diff than
parent child.
td: for me, the overlaps is the most important thing. Andreas just get
everything.
ad: can we delegate to gregg here for what to do in case of inside.

[A] gregg will write up description for inside query and multiple locations

Status reports
-----------------

gh: updating server. overlaps, insides, types, and each
good news: latest genome assembly on human on affy server overlayed
with allen's server. using hardcoded knowledge in igb for assembly id,
not coordinates yet.
with andrew: making sure clients can understand any variants of
namespace usage in the xml.
get client to use more capabilities like links

ad: example data set together, updated schema to latest spec, but
forgot cigar thing. update validator to use most recent version or rnc
schemas.
gh: even if your server isn't public you can cut and paste into you
validator at http://cgi.biodas.org:8080

aday: biopackages up to date with version 200 of spec file. issues for
nomi, and gregg. off by one error.

bo: small code refactor in the das server. testing that today.

ee: nothing das related yet, but will. implementing style sheets to get
colors for features.

ap: registry ui for upload of a das/2 source. coding for that

gh: what about registry rejecting segment ids if they don't match
standard ids for that coord system. sound good to you?
ap: basically yes. 
td: not done a great deal

gh: Nomi has been here working on apollo client. we'll hear from her
tomorrow. 

-----------------------
post teleconf discussion re: using global identifiers for uri

[Notetaker: just a few morsels were captured here.]

ad: most folks i work with get something going locally, then after
it's going, hook it up with the rest of the world, integrate with
other people. they don't want to revamp their work in order to do
that. 

gh: slightly in favor with andrew

ad: get what we have now. they are still uri's so it's just an
interpretation. will change attributes to be 'uri and 'reference_uri'

gh: how does it get length of segments?


ad: good idea to have coordinates and segments in the document.
add your own track to ensembl, you don't need to give it a segments,
just specify coordinates.
gh: seems like it will encourage servers that can only work with
particular clients.

ad: what about getting rid of coordinates, just needed by Andreas for
registry. 


From Steve_Chervitz at affymetrix.com  Thu Mar 16 15:38:13 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 16 Mar 2006 12:38:13 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day four, 16 Mar 2006
Message-ID: <C03F0CB5.1D0F8%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day four, 16 Mar 2006

$Id: das2-teleconf-2006-03-16.txt,v 1.1 2006/03/16 20:45:48 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Gregg Helt
  CSHL: Lincoln Stein
  Dalke Scientific: Andrew Dalke (at Affy)
  Sanger: Andreas Prlic
  UC Berkeley: Nomi Harris (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Status reports
---------------

nh: apollo work, reading the registry, saving
capabilties. modifications to code that was based on prototype das
adaptor. Generally lots of under the hood work to bring it up to spec.

bo: diff functionality between allen's server biopackages.net server
and andrew's samepl xml. Updated templates in allen's das server to
match andrew's sample xml.

ad: worked on validation server, all stuff is in cvs. the
http://cgi.openbio.org:8080 server is built off cvs, just check out
and rebuild. 

gh: worked on affy das2 server and client up to current spec based on
whatever the rnc documents say (schema doc) as for xml. no chance to read
andrew's email on query syntax, will incorporate that today.

sc: got latest version of gregg's das/2 server up at affy. serving
hg17, hg16, dm2. Updated code that the das1 server is using based on
latest genoviz jars. Getting some errors when loading data for new
affy arrays. Investigating.

aday: minor bug fixes for spec v200. exporting assay data as different
views.
ucsc browser can viz expression data out of das server in bed format.
das viewer can view as egr format. working on single chip at a time.

ls: here's a great use case for you: there's a cshl fellow creating dna
spectrographs of oligo frequencies presented as audiographs. can really
tell diffs from coding vs non-coding, CpG triplets, microsatellites
harmonics, big matrices of floating point data tied to genome.
consider this a challenge to das to serve this up.
my postdoc sheldon mckay is serving this up give you heatmap back
given a genomic region. new glyph for spectrographic data

aday: format netCDF is good for this, but clients out there don't
vizualize it.
gh: would like to support netCDF in igb. not sure if this is default
way to represent qualtitative data for das.

[A] allen will send lincoln pointer to netCDF.

aday: netCDF is great for cross-lang, cross platform support.
gh: people are pushing wiggle format to ucsc, so we don't want to
restrict to just netCDF.

aday: my refactor yesterday allows treatment of these as templates.
gh: how do this via region query in das?
ls: feature query, tag says here comes binary data, each column
corresponds to a base (or maybe a scaling factor to indicate # of bp
per column). 
tag says here comes binary qualtitatilve data, scale is 1:1.
gh: better way is to use alternative content format stuff (already in
spec for types)
ls: if you do feat request and don't filter by type, you'll get a mix
of binary and non binary.
aday: not in genome domain, genome/sequence the fetch to assay service
to get quant data. then do intersection to find overlap.
performance goes out window if you make the query too complex.
fine to do just two fetches.

ls: how indicate scale for numerical scale?
aday: good question. units are not encoded now.
ls: spectogrphic data one value per window where window is 100 bp
aday: so two diff units
window size, amplitude value and frequency, and that's in four
channels for the bases. we're representing as 4 matrices.
aday: one matrix per channel.many formats don't support n-dimensional
data. only 2d at most.
ls: in das1 did base64 encoded string in the notes. It worked.
gh: we can't require all clients to know how to interpret it.
This is why we have the alt content functionality...

[A] das should support dense numeric data across regions, format specified
by the existing alternative format mechanism

Topic: Spec Freeze
-------------------

ls: can we talk about feezing spec?
ad: what good will it do?
ls: allow us to code to a fixed spec. you freeze spec, people write
code for a defined period of time, during that time we compare notes,
then make changes, freeze, and repeat.
ad: concerned there hasn't been enough work since the changes in jan/feb.
ls: now that i'm 'on the other side of the fence' of spec writing,
i'd like to see it not change, and have time to make an informed view
of what it's strengths and weaknesses are.
ad: haven't gotten feedback about my questions, until the
codesprints. two months ago, only now being addressed.
ls: these issues don't become pressing until we start
implementing. this is why we do code sprints.
ad: worry because there's been no extensive data modeling for
features.
ls: can do a 1 month freeze
gh: comfortable with 1 mon freeze of schemas as they are in the rnc's
now. issues will come up.
ls: announce on biodas.org - march 18th das/2 is frozen for 1 month.
gh: we'll have to live to ambiguity with how server does certain
things.
ls: hence the time limited 'trial' freeze.
ad: would have like people to write code from last feb so I could get
feedback.
ls: you very much improved the spec. grateful for what you've done. I
wasn't getting feedback when I was writing either.
gh: validation website is great for implementers, rather than having
to read a spec document everyday.
ad: schemas aren't going to change after today (pm). would like to
clear some things up about filter language, today?
ls: most urgent freeze

[A] spec will freeze as of end of today (3/16/06, PST) for one month.

Topic: Feature filters
----------------------

ad: feature filters is most important, and how do we define global
names? schema is a simple change - which is req'd and which is
optional but for impls makes a big diff.
ls: global is req'd and local is optional.
ad: who comes up with global names
ls: first person to do it has naming rights.
people have been able to do it for the ensembl service.
ad: I need documented names
gh: it means you don't know whether two names are the same thing until
this document comes out.

ls: filter language?
ad: gregg needs inside and contains,
- type and exact type: das type or ontology type?
ls: das type
gh: uri attribute of the type
ad: that type or it's subtype makes no sense for das types
ls: it's just an exact match. client can use ontology to get a series
of types
ls: should be an exact match, does not traverse ontology.
client should ask user: do you want all exons or a specific type of
exon? 
ls: client goes through ontology as necesary

[A] drop exacttype, type now has exacttype semantics

Topic: XID, feature ids
------------------------

ad: xid in features. no one used yet. gives a ref to some other
db. all it is is a url/uri. feels like there should be more info
(type?)
ad: primary name field for feature, feels like should be name
ls: name is human readable. title would be ok
ad: but feature filter is called name searches name and id fields
ls: this is correct behavior, you can do a fetch on the url/uri
this is ok.
ad: the name feature searches title and alias.

gh: if feature id is resolvable and you resolve it, there's no
guarantee it gives back a das2xml document.
if the feature uri is resolvable, and you fetch it, you will get back
a das2xml document right?
can you put uri in the feature query?
aday: feels that having auto-generated names
ad: do all features have a human readable name?
gh/ls: optional
ad: why would you want to put a url in a name field?
gh: rdf
ad: should be a resolvable resource, das2xml for that feature.

ad: features with aliases, do aliases need type pk or accession?
prosite has false match to ...
ls: this is a property or xid, not alias
ad: suggests that xid needs extra stuff to it.
gh: file with an optional type attribute on xid
ad: let's wait to someone has a need.

Topic: Feature filters (continued)
----------------------------------

gh: feature filters, inside, contains, identical. Which do we need,
which can we drop?

[A] overlaps - keep (all agree)

inside - gregg needs
contains - dropping, maybe
identical - dropping

ad: what about excludes - the complement of overlap?
gh: haven't had time to investigate whether I can use excludes rather
than the inside + overlaps (contains?) combination I need now.

ls: use case: pointing to children and they haven't arrived yet.
gh: my client keeps stuff around, when you get parent/child if you
have parent + all children you can construct feature.
ls: the spec requires single parent, right?
gh: no you can have multiple.
ls: gff3 spec also allows mult parent and children

[A] Lincoln will provide use cases/examples of these features scenarios:
- three or greater hierarchy features
- multiple parents
- alignments


Topic: Registry 
----------------

ap: still here.
gh: looking at registry, having trouble retrieving in a normal
browser. when looking at it in client, I only see biopackages server
registered as a server. Lincoln said there was more?
ap: this is related to mime types, changed from text plain to
x-das-sources 
gh: I get an error: source file could not be red.
lincoln said you added other test das2 servers to it.
ap: working on interface so users can upload servers.
half way through it now. upload a link to sources.
will send email once it's there.

[A] Steve will add gregg's new affy das/2 server to registry when Andreas'
web interface is ready

gh: same time tomorrow.


From cjm at fruitfly.org  Thu Mar 16 15:50:37 2006
From: cjm at fruitfly.org (chris mungall)
Date: Thu, 16 Mar 2006 12:50:37 -0800
Subject: [DAS2] query language description
In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
Message-ID: <e3ad320577f173bc0234ca4df6d16645@fruitfly.org>

Hi Andrew

I presume one constraint is that you want to preserve standard CGI URL 
syntax? I think this is the best that can be done using that 
constraint, which is to say, fairly limited. This lacks one of the most 
important features of a real query language, composability. These 
ad-hoc constraint syntaxes have their uses but you'll eventually want 
to go beyond the limits and end up adding awkward extensions. Why not 
just forego the URL constraint and go with a composable extendable 
query language in the first place and save a lot of bother downstream?

On Mar 15, 2006, at 9:17 PM, Andrew Dalke wrote:

> The query fields are
>
>    name      |  takes | matches features ...
>   ==========================
>    xid       |  URI   | which have the given xid
>    type      |  URI   | with the given type or subtype (XX keep this
> one???)
>    exacttype |  URI   | with exactly the given type
>    segment   |  URI   | on the given segment
>    overlaps  | region | which overlap the given region
>    inside    | region | which are contained inside the given region (XX
> needed??)
>    contains  | region | which contain the given region  (XX needed?? )
>    name      | string | with a name or alias which matches the given
> string
>    prop-*    | string | with the property "*" matching the given string
>
> Queries are form-urlencoded requests.  For example, if the features
> query URL is 'http://biodas.org/features' and there is a segment named
> 'http://ncbi.org/human/Chr1' then the following is a request for all 
> the
> features on the first 10,000 bases of that segment
>
> The query is for
>      segment = 'http://ncbi.org/human/Chr1'
>      overlaps = 0:10000
>
> which is form-urlencoded as
>
>
> http://biodas.org/features?
> segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000
>
> Multiple search terms with the same key are OR'ed together.  The
> following
> searches for features containing the name or alias of either
> BC048328 or BC015400
>
>    http://biodas.org/features?name=BC048328;name=BC015400
>
> Multiple search terms with different keys are AND'ed together,
> but only after doing the OR search for each set of search terms with
> identical keys.  The following searches for features which have
> a name or alias of BC048328 or BC015400 and which are on the segment
> http://ncbi.org/human/Chr1
>
>
> http://biodas.org/features?name=BC048328;
> segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400
>
> The order of the search terms in the query string does not affect
> the results.
>
> If any part of a complex feature (that is, one with parents
> or parts) matches a search term then all of the parents and
> parts are returned.  (XXX Gregg -- is this correct? XXX)
>
>
> The fields which take URLs require exact matches.
>
> I think we decided that there is no type inferencing done in
> the server; it's a client side thing.  In that case the 'type'
> field goes away.  We can still keep 'exacttype'.  The URI
> used for the matching is the type uri, and NOT the ontology URI.
>
> (We don't have an ontology URI yet, and when we do we can add
> an 'ontology' query.)
>
> The segment URI must accept the local identifier.  For
> interoperability with other servers they must also accept the
> equivalent global identifier, if there is one.
>
> If range searches are given then one and only one segment is
> allowed.  Multiple segments may be given, but then ranges are not
> allowed.
>
> The string searches support a simple search language.
>      ABC  -- contains a word which exactly matches "ABC" (identity, not
> substring)
>     *ABC  -- words ending in "ABC"
>      ABC* -- words starting with "ABC"
>     *ABC* -- words containing the substring "ABC"
>
> If you want a field which exactly contains a '*' you're kinda
> out of luck.  The interpretation of whitespace in the query or
> in the search string is implementation dependent.  For that
> matter, the meaning of "word" is implementation dependent.  (Is
> *O'Malley* one word? *Lethbridge-Stewart*?)
>
> When we looked into this last month at Sanger we verified that
> all the databases could handle %substring% searches, which was
> all that people there wanted.  The Affy people want searches for
> exact word, prefix and suffix matches, as supported by the the
> back-end databases.
>
>
>    XXX CORRECT ME XXX
>
> The 'name' search searches.... It used to search the 'name'
> attribute and the 'alias' fields.  There is no 'name' now.  I
> moved it to 'title'.  I think I did the wrong thing; it should
> be 'name', but it's a name meant for people, not computers.
>
> Some features (sub-parts) don't have human-readable names so
> this field must be optional.
>
>
> The "prop-*" is a search of the <PROP> elements.  Features may
> have properties, like
>
>     <PROP key="cellular_component" value="membrane" />
>
> To do a string search for all 'membrane' cellular components,
> construct the query key by taking  the string "prop-" and
> appending the property key text ("cellular_component").  The
> query value is the text to search for.
>
>      prop-cellular_component=membrane
>
> To search for any cellular_component containing the substring "mem"
>
>      prop-cellular_component=*membrane*
>
> The rules for multiple searches with the same key also apply to the
> prop-* searches.  To search for all 'membrane' or 'nuclear'
> cellular components, use two 'prop-cellular_component' terms, as
>
>
> http://biodas.org/features?prop-cellular_component=membrane;prop-
> cellular_component=membrane
>
>
> The range searches are defined with explicit start and end
> coordinates.  The range syntax is in the form "start:end", for
> example, "1:9".
>
> Let 'min' be the smallest coordinate for a feature on a given
> segment and 'max' be one larger than the largest coordinate.
> These are the lower and upper founds for the feature.
>
> An 'overlaps' search matches if and only if
>     min < end AND max > start
>
> XXX For GREG XXX
>
> What do 'inside' and 'contains' do?  Can't we just get
> away with 'excludes', which has complement of 'overlaps'?
> Searches are done as:
>    Step 0) specify the segment
>    Step 1) do all the includes  (if none, match all features on 
> segment)
>    Step 2) do all the excludes, inverted (like an includes search)
>    Step 3) only return features which are in Step 1 but not
>        in Step 2)
>    Step 4) ...
>    Step 5) Profit!
>
> I think this will support your smart code, and it's easy
> enough to implement.
>
> Every one but you was planning to use 'overlaps'.  Only you
> wanted to use 'inside'.  Anyone want to use 'contains'?
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Thu Mar 16 18:24:25 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 15:24:25 -0800
Subject: [DAS2] 'source' attribute in the types document
Message-ID: <c9f78d43e5f0267a571095bd0b735418@dalkescientific.com>

Types have a 'source' field.

The first draft shows examples like
   source='curated'
   source='genescan'
   source='tRNAscan-SE-1.11'

My interpretation is that this is a human readable field,
with no machine interpretation other than as a string.  It
does not come from a controlled vocabulary.  It may contain
spaces.

This field is not currently searchable because we expect the
number of types to be small enough a client will download
everything and do the search locally.

Let me know if I'm wrong.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 17:46:14 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 14:46:14 -0800
Subject: [DAS2] query language description
In-Reply-To: <e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
Message-ID: <e3e550c9fe3d623420739a36073d0891@dalkescientific.com>

Hi Chris,

> I presume one constraint is that you want to preserve standard CGI URL 
> syntax?

Yes.

>  I think this is the best that can be done using that constraint, 
> which is to say, fairly limited.

Then again, the functionality we need is also fairly limited.

>  This lacks one of the most important features of a real query 
> language, composability. These ad-hoc constraint syntaxes have their 
> uses but you'll eventually want to go beyond the limits and end up 
> adding awkward extensions. Why not just forego the URL constraint and 
> go with a composable extendable query language in the first place and 
> save a lot of bother downstream?

Because no one can decide on a generic language which is more
powerful than this.

Anything more powerful would need to support .. boolean algebra?
numeric searches?  regexps?  What about quoting rules for "multiple
word phrases"?

Is it SQL-like?  XPath/XQuery-like?  Is it a context-free grammar?
How easy is it to implement and work cross-platform?

For what people need now, this search solution seems good.

For the future we can have

   <CAPABILITY type="xpath-query" query_uri="http://whatever" />

and clients which understand that interface will know that it's
there.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 18:38:07 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 15:38:07 -0800
Subject: [DAS2] new search terms
Message-ID: <5a29cf88a8fc1e8e8448c6e1dd248dbb@dalkescientific.com>

"note=" is a string search of the note fields

   Example: note=And*
     finds all features where which have a note containing
     a word starting with 'And'

"coordinates=" filters for features on that coordinate system.
  (We talked about this one yesterday.)

I'm republish the search terms before the end of the day.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 18:54:12 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 15:54:12 -0800
Subject: [DAS2] comments in schema
Message-ID: <e3731b48f7b05d690ec4fe1e873c482d@dalkescientific.com>

I've updated the schema docs (das/das2/draft3/*.rnc )
to include more detailed comments.

Also, updated the ucla examples to change 'synonym' to
'reference'.

Everything should be up to date.


					Andrew
					dalke at dalkescientific.com


From cjm at fruitfly.org  Thu Mar 16 19:04:03 2006
From: cjm at fruitfly.org (chris mungall)
Date: Thu, 16 Mar 2006 16:04:03 -0800
Subject: [DAS2] query language description
In-Reply-To: <e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
	<e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
Message-ID: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org>


On Mar 16, 2006, at 2:46 PM, Andrew Dalke wrote:

> Hi Chris,
>
>> I presume one constraint is that you want to preserve standard CGI URL
>> syntax?
>
> Yes.

I'm guessing you've been through this debate before, so no comment..

>
>>  I think this is the best that can be done using that constraint,
>> which is to say, fairly limited.
>
> Then again, the functionality we need is also fairly limited.

ignorant question.. (I have only been tangentially aware of the outer 
edges of the whole das2 process)..

how are you determining the functionality required? surely someone 
somewhere will want to write a das2 client that implements boolean 
queries

I speak from experience - I designed the GO Database API to have a very 
similar constraint language (it's expressed using perl hash keys rather 
than CGI parameters but the same basic idea). For years people have 
been clamouring for the ability to do more complex queries - right now 
they are forced bypass the constraint language and go direct to SQL.

>
>>  This lacks one of the most important features of a real query
>> language, composability. These ad-hoc constraint syntaxes have their
>> uses but you'll eventually want to go beyond the limits and end up
>> adding awkward extensions. Why not just forego the URL constraint and
>> go with a composable extendable query language in the first place and
>> save a lot of bother downstream?
>
> Because no one can decide on a generic language which is more
> powerful than this.
>
> Anything more powerful would need to support .. boolean algebra?
> numeric searches?  regexps?  What about quoting rules for "multiple
> word phrases"?
>
> Is it SQL-like?  XPath/XQuery-like?  Is it a context-free grammar?
> How easy is it to implement and work cross-platform?

None of these really lit into the DAS paradigm. I'm guessing you want 
something simple that can be used as easily as an API with get-by-X 
methods but will seamlessly blend into something more powerful. I think 
what you have is on the right lines. I'm just arguing to make this 
language composable from the outset, so that it can be extended to 
whatever expressivity is required in the future, without bolting on a 
new query system that's incompatible with the existing one.

The generic language could just be some kind of simple extensible 
function syntax for search terms, boolean operators, and some kind of 
(optional) nesting syntax.

If you have boolean operators and it's composable, then yep it does 
have to be as expressive as boolean algebra.

I'd argue that implementing a composable query language is easier than 
an ad-hoc one

> For what people need now, this search solution seems good.
>
> For the future we can have
>
>    <CAPABILITY type="xpath-query" query_uri="http://whatever" />
>
> and clients which understand that interface will know that it's
> there.

hmm, not sure how useful this would be - surely you'd want something 
more dasmodel-aware?

if you're going to just pass-through to xpath or sql then why have a 
das protocol at all?

>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From Gregg_Helt at affymetrix.com  Thu Mar 16 19:22:54 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Thu, 16 Mar 2006 16:22:54 -0800
Subject: [DAS2] query language description
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA3B@msex02.affymetrix.com>

For the type query filter, I'd suggest keeping the exacttype semantics
you discuss below, but using "type" for the field name rather than
"exacttype".  If we're getting rid of one of them, and a non-exact type
is a meaningless concept, it seems like keeping that "exact" part is
unnecessary and potentially confusing.

	gregg
> 
> I think we decided that there is no type inferencing done in
> the server; it's a client side thing.  In that case the 'type'
> field goes away.  We can still keep 'exacttype'.  The URI
> used for the matching is the type uri, and NOT the ontology URI.
> 
> (We don't have an ontology URI yet, and when we do we can add
> an 'ontology' query.)
> 
> The segment URI must accept the local identifier.  For
> interoperability with other servers they must also accept the
> equivalent global identifier, if there is one.
> 
> If range searches are given then one and only one segment is
> allowed.  Multiple segments may be given, but then ranges are not
> allowed.
> 
> The string searches support a simple search language.
>      ABC  -- contains a word which exactly matches "ABC" (identity,
not
> substring)
>     *ABC  -- words ending in "ABC"
>      ABC* -- words starting with "ABC"
>     *ABC* -- words containing the substring "ABC"
> 
> If you want a field which exactly contains a '*' you're kinda
> out of luck.  The interpretation of whitespace in the query or
> in the search string is implementation dependent.  For that
> matter, the meaning of "word" is implementation dependent.  (Is
> *O'Malley* one word? *Lethbridge-Stewart*?)
> 
> When we looked into this last month at Sanger we verified that
> all the databases could handle %substring% searches, which was
> all that people there wanted.  The Affy people want searches for
> exact word, prefix and suffix matches, as supported by the the
> back-end databases.
> 
> 
>    XXX CORRECT ME XXX
> 
> The 'name' search searches.... It used to search the 'name'
> attribute and the 'alias' fields.  There is no 'name' now.  I
> moved it to 'title'.  I think I did the wrong thing; it should
> be 'name', but it's a name meant for people, not computers.
> 
> Some features (sub-parts) don't have human-readable names so
> this field must be optional.
> 
> 
> The "prop-*" is a search of the <PROP> elements.  Features may
> have properties, like
> 
>     <PROP key="cellular_component" value="membrane" />
> 
> To do a string search for all 'membrane' cellular components,
> construct the query key by taking  the string "prop-" and
> appending the property key text ("cellular_component").  The
> query value is the text to search for.
> 
>      prop-cellular_component=membrane
> 
> To search for any cellular_component containing the substring "mem"
> 
>      prop-cellular_component=*membrane*
> 
> The rules for multiple searches with the same key also apply to the
> prop-* searches.  To search for all 'membrane' or 'nuclear'
> cellular components, use two 'prop-cellular_component' terms, as
> 
> 
> http://biodas.org/features?prop-cellular_component=membrane;prop-
> cellular_component=membrane
> 
> 
> The range searches are defined with explicit start and end
> coordinates.  The range syntax is in the form "start:end", for
> example, "1:9".
> 
> Let 'min' be the smallest coordinate for a feature on a given
> segment and 'max' be one larger than the largest coordinate.
> These are the lower and upper founds for the feature.
> 
> An 'overlaps' search matches if and only if
>     min < end AND max > start
> 
> XXX For GREG XXX
> 
> What do 'inside' and 'contains' do?  Can't we just get
> away with 'excludes', which has complement of 'overlaps'?
> Searches are done as:
>    Step 0) specify the segment
>    Step 1) do all the includes  (if none, match all features on
segment)
>    Step 2) do all the excludes, inverted (like an includes search)
>    Step 3) only return features which are in Step 1 but not
>        in Step 2)
>    Step 4) ...
>    Step 5) Profit!
> 
> I think this will support your smart code, and it's easy
> enough to implement.
> 
> Every one but you was planning to use 'overlaps'.  Only you
> wanted to use 'inside'.  Anyone want to use 'contains'?
> 
> 					Andrew
> 					dalke at dalkescientific.com
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Thu Mar 16 21:05:06 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 18:05:06 -0800
Subject: [DAS2] query language description
In-Reply-To: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
	<e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
	<8b7582943da22dfed23ba7b5386402fb@fruitfly.org>
Message-ID: <c4433b247b29525254354103b60ce414@dalkescientific.com>

Chris:
> ignorant question.. (I have only been tangentially aware of the outer 
> edges of the whole das2 process)..
>
> how are you determining the functionality required? surely someone 
> somewhere will want to write a das2 client that implements boolean 
> queries

It was informal, based on feedback from client developers and 
maintainers.
Lincoln, Thomas, Andreas, Gregg and others provided that feedback.
It was not by talking with users.

I know there's a wide range of users and use cases.  The point
of this query language is to have basic functionality that all
servers can implement.

> right now they are forced bypass the constraint language and go direct 
> to SQL.

In addition, we provide defined ways for a server to indicate
that there are additional ways to query the server.

> None of these really lit into the DAS paradigm. I'm guessing you want 
> something simple that can be used as easily as an API with get-by-X 
> methods but will seamlessly blend into something more powerful. I 
> think what you have is on the right lines. I'm just arguing to make 
> this language composable from the outset, so that it can be extended 
> to whatever expressivity is required in the future, without bolting on 
> a new query system that's incompatible with the existing one.

We have two ways to compose the system.  If the simple query language
is extended, for example, to support word searches of the text field
instead of substring searches, then a server can say

<CAPABILITY type="features" 
query_uri="http://somewhere.over.rainbow/server.cgi">
   <SUPPORTS name="word-search"/>
</CAPABILITY>

This is backwards compatible, so the normal DAS queries work.  But
a client can recognize the new feature and support whatever new filters
that 'word-search' indicates, eg

   http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*

(finds features with notes containing words starting with 'Andre' )

These are composable.  For example, suppose Sanger allows modification
date searches of curation events.  Then it might say

<CAPABILITY type="features" 
query_uri="http://somewhere.over.rainbow/server.cgi">
   <SUPPORTS name="word-search"/>
   <SUPPORTS name="sanger-curation"/>
</CAPABILITY>

and I can search for notes containing words starting with "Andre"
which were modified by "dalke" between 2002 and 2005 by doing

   http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
        modified-by=dalke&modified-before=2005&modified-after=2002


An advantage to the simple boolean logic of the current system
is that the GUI interface is easy, and in line with existing
simple search systems.


If someone wants to implement a new search system which is
not backwards compatible then the server can indicate that
alternative with a new CAPABILITY.  Suppose Thomas at Sanger
comes up with a new search mechanism based on an object query
language he invented,

<CAPABILITY type="down-oql"
     query_uri="http://sanger.ac.uk/oql-search" />

The Sanger and EBI clients might understand that and support
a more complex GUI, eg, with a text box interface.  Everyone
else must ignore unknown capability types.

Then that would be POSTED (or whatever the protocol defines)
to the given URL, which returns back whatever results are
desired.

Or the server can point to a public MySQL port, like

<CAPABILITY type="mysql-connection"
     query_uri="mysql://username:password at hostname:port/databasename" />

That's what you are doing to bypass the syntax, except that
here it isn't a bypass; you can define the new interface in
the DAS sources document.

> The generic language could just be some kind of simple
> extensible function syntax for search terms, boolean operators,
> and some kind of (optional) nesting syntax.

Which syntax?  Is it supposed to be easy for people to write?
Text oriented?  Or tree structured, like XML, or SQL-like?
And which clients and servers will implement that search
language?

If there was a generic language it would allow
   OR("on segment Chr1 between 1000 and 2000",
      "on segment ChrX between 99 and 777")
which is something we are expressly not allowing in DAS2
queries.  It doesn't make sense for the target applications
and by excluding it it simplifies the server development,
which means less chance for bugs.

Also, I personally haven't figured out a decent way to
do a GUI composition of a complex boolean query which is
as easy as learning the query language in the first place.

A more generic language implementation is a lot of overhead
if most (80%? 90%) need basic searches, and many of the
rest can fake it by breaking a request into parts and
doing the boolean logic on the client side.

Feedback I've heard so far is that DAS1 queries were
acceptable, with only a few new search fields needed.

> hmm, not sure how useful this would be - surely you'd want something
> more dasmodel-aware?

The example I gave was a bad one.  What I meant was to show
how there's an extension point so someone can develop a new
search interface and clients can know that the new functionality
exists, without having to change the DAS spec.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 23:47:58 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 20:47:58 -0800
Subject: [DAS2] query language description
In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
Message-ID: <a27d1b1df9baf83885ce73513ed70b96@dalkescientific.com>

Updated:
   - added 'note' as a query field
   - changed string searches to substring (not word) searches
        and made them be case insensitive

       "AB" matches only the strings "AB", "Ab", "aB" and "ab"
       "*AB" matches only fields which exactly end with
               "AB", "ab", "aB", and "Ab"
       "AB*" matches only fields which exactly match, up to case
       "*AB*" matches only fields which contain the substring,
             up to case

   - added 'coordinates' search

   - removed 'type' and renamed 'exacttype' to 'type'

   - removed 'contains' search, which no one said they wanted.  Instead,
      supporting (EXPERIMENTAL) an 'excludes' search.


==================================

The query fields are

   name      |  takes | matches features ...
  ==========================
   xid       |  URI   | which have the given xid
   type      |  URI   | with exactly the given type
   segment   |  URI   | on the given segment
coordinates |  URI   | which are part of the given coordinate system
   overlaps  | region | which overlap the given region
   excludes  | region | which have no overlap to the given region
   inside    | region | which are contained inside the given region
   name      | string | with a title or alias which matches the given  
string
   note      | string | with a note which matches the given string
   prop-*    | string | with the property "*" matching the given string

Queries are form-urlencoded requests.  For example, if the features
query URL is 'http://biodas.org/features' and there is a segment named
'http://ncbi.org/human/Chr1' then the following is a request for all the
features on the first 10,000 bases of that segment

The query is for
     segment = 'http://ncbi.org/human/Chr1'
     overlaps = 0:10000

which is form-urlencoded as

    
http://biodas.org/features? 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000

Multiple search terms with the same key are OR'ed together.  The  
following
searches for features containing the name or alias of either
BC048328 or BC015400

   http://biodas.org/features?name=BC048328;name=BC015400

The 'excludes' search is an exception.  See below.

Multiple search terms with different keys are AND'ed together,
but only after doing the OR search for each set of search terms with
identical keys.  The following searches for features which have
a name or alias of BC048328 or BC015400 and which are on the segment
http://ncbi.org/human/Chr1

    
http://biodas.org/features?name=BC048328; 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400

The order of the search terms in the query string does not affect
the results.

If any part of a complex feature (that is, one with parents
or parts) matches a search term then all of the parents and
parts are returned.  (XXX Gregg -- is this correct? XXX)


The fields which take URLs require exact matches, that is, a
character by character match.  (For details on the nuances of
comparing URIs see http://www.textuality.com/tag/uri-comp-3.html )

(We don't have an ontology URI yet, and when we do we can add
an 'ontology' query.)

The segment query filter takes a URI.  This must accept
the segment URI and, if known to the server, the equivalent
reference identifier for the segment.

If range searches are given then one and only one segment
must be given.  If there are multiple segment queries then
ranges are not allowed.

The string searches may be exact matches, substring, prefix
or suffix searches.  The query type depends on if the search
value starts and/or ends with a '*'.

     ABC  -- field exactly matches "ABC"
    *ABC  -- field ends with "ABC"
     ABC* -- field starts with "ABC"
    *ABC* -- field contains the substring "ABC"

The "*" has no special meaning except at the start or end
of the query value.  The search term "***" will match a
field which contains the character "*" anywhere.  There
is no way to match fields which exactly match '*' or
which only start or end with that character.

Text searches are case-insensitive.  The string "ABC"
matches "abc", "aBc", "ABC", etc.

A server may choose to collapse multiple whitespace
characters into a single space character for search purposes.
For example, the query "*a newline*" should match

   "This is a line of text which contains a
    newline"


The 'name' search does a text search of the 'title' and 'alias'
fields.


The "prop-*" is shorthand for a class of text searches of
<PROP> elements.  Features may have properties, like

    <PROP key="cellular_component" value="membrane" />

To do a string search for all 'membrane' cellular components,
construct the query key by taking  the string "prop-" and
appending the property key text ("cellular_component").  The
query value is the text to search for, in this case:

     prop-cellular_component=membrane

To search for any cellular_component containing the substring
"membrane"

     prop-cellular_component=*membrane*

The rules for multiple searches with the same key also apply to the
prop-* searches.  To search for all 'membrane' or 'nuclear'
cellular components, use two 'prop-cellular_component' terms, as

      
http://biodas.org/features?prop-cellular_component=membrane;prop- 
cellular_component=membrane


The range searches are defined with explicit start and end
coordinates.  The range syntax is in the form "start:end", for
example, "1:9".  There is no way to restrict the search to
a specific strand.

A feature may have several locations.  An annotation may
have several features in a parent/part relationship.  The
relationship may have several levels.  If a range search
matches any feature in the annotation then the search
returns all of the features in the annotation.

An 'overlaps' search matches if and only if any feature
location of any of the parent or part overlaps the query
range and segment.

An 'inside' search matches if and only if at least one
feature in the annotation has a location on the query segment
and all features which have a location on the query segment
have at least one location which starts and ends in the
query range.

EXPERIMENTAL: An 'excludes' matches if and only if at
least one feature of the annotation is on the query segment
and no features are in the query range.  This is the
complement of the 'overlaps' search, for annotations on
the same query segment.

Unlike the other search keys, if there multiple 'excludes'
searches then the results are AND'ed together.  That is,
if the query is has two excludes ranges
    segment=ChrX excludes=RANGE1 excludes=RANGE2
then the result are those features which on ChrX which
are not in RANGE1 and are not in RANGE2.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Fri Mar 17 02:05:54 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 23:05:54 -0800
Subject: [DAS2] alternate formats
Message-ID: <3f895441c38b74460da9f8e4582b7a74@dalkescientific.com>

If you've read the updated schema definitions you saw
I've added the following comment in the CAPABILITY

   # Format names which can be passed to the query_uri.
   # The names are type dependent.  At present the
   # only reserved names are for the 'features' capability.
   # These are: das2xml, count, uris
   format*,


We talked about this in the UK I think, and I mentioned
it to people here.  The 'count' format returns the count
of features which would be returned for a given query.

This is a single line containing the integer followed by
a newline.  The content-type of the document is text/plain .

For example, to get the number of all the features on
the server

Request:

http://www.example.com/das2/mus/v22/features?format=count

Response:

Content-Type: text/plain

129254


I will add this format description to the spec.


When does the server need to declare that it implements
a given document type?  My thought is that if the format
list is not specified then the server must implement
'das2xml' and 'count' formats.  If it doesn't implement
the 'count' format then it needs to declare the complete
list of what it does support.


In addition I'll describe here the 'uris' format.  It is
a document of content-type text/plain containing the
matching feature URIs, one per line.  For example,

file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.0.A1_3p_a_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.0.A1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.1.S1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.2.S1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.3.S1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P:Hs.271468.0.S1_3p_at


(I feel like it should implement an xml:base scheme to reduce
the amount of traffic.)

The idea is that a client can request the URIs only, eg,
to do more complex boolean-esque searches by doing simpler
ones on the server and combining the results in client space.
For another example, if the client already knows the feature
data for a URI then it doesn't need to download the data again.
So it gets a list of URIs then only fetches the ones it
does not know about.

This requires HTTP/1.1 pipelining for good performance.

Because there are no clients which want it, because I'm not
certain on the format, and because of the lack of pipelining
in the existing servers, I will not document this format.  I'll
just leave it as a reserved format name.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Fri Mar 17 02:33:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 23:33:44 -0800
Subject: [DAS2] debugging validation proxy
Message-ID: <c5545a0079e981c657e86d60057859a4@dalkescientific.com>

After a conversation with Gregg this afternoon I this evening
implemented a debugging validation proxy for DAS.  The code
is about 100 lines long and combines Python's "twisted" network
library and the dasypus validator.

To make it work, configure your DAS client to use a proxy,
which is this validation proxy.  Then do things like normal.
The request go through the proxy.  It dumps the request
info to stdout and forwards the request to the real server.

It requires the response headers and body.  When finished
it passed the data to dasypus.

I stuck some DAS-ish XML on my company web server
and did the connection like this

% curl -x localhost:8080 http://www.dalkescientific.com/sources.xml

The output from the debug window is

Making request for 'http://www.dalkescientific.com/sources.xml'
Warning: Unknown Content-Type 'application/xml'.
Info: Assuming doctype of 'sources' based on root element at byte 40, 
line 2, column 2
Finished processing


					Andrew
					dalke at dalkescientific.com


From allenday at ucla.edu  Thu Mar 16 13:27:56 2006
From: allenday at ucla.edu (Allen Day)
Date: Thu, 16 Mar 2006 10:27:56 -0800 (PST)
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <200603151046.43196.lstein@cshl.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
Message-ID: <Pine.LNX.4.58.0603161027260.30576@sumo.ctrl.ucla.edu>

Hi Lincoln,

Please just code to what is there, and expect your code to break when I 
update the biopackages server to v300 (probably next week).

-Allen

On Wed, 15 Mar 2006, Lincoln Stein wrote:

> Hi Folks,
> 
> I just ran through the source request on biopackages.net and it is returning 
> something that is very different from the current spec (CVS updated as of 
> this morning UK time). I understand why there is a discrepancy, but for the 
> purposes of the code sprint, should I code to what the spec says or to what 
> biopackages.net returns? It is much more fun for me to code to a working 
> server because I have the opportunity to watch my code run.
> 
> Best,
> 
> Lincoln
> 
> 


From Gregg_Helt at affymetrix.com  Fri Mar 17 03:22:12 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Fri, 17 Mar 2006 00:22:12 -0800
Subject: [DAS2] New affymetrix das/2 development server
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA3D@msex02.affymetrix.com>


I checked in a new version of the Affymetrix DAS/2 server this evening
that supports XML responses based on the latest DAS/2 spec, version 300.
For sample sources, segments, types, and features responses it passes
the Dasypus validator tests.  The validator was _very_ useful for
bringing the server up to the current spec!  Steve rolled the new
version out on our public test server, the root sources query URL is
http://205.217.46.81:9091/das2/genome/sequence.  In the latest version
of IGB checked into CVS, this server can be accessed as "Affy-temp" in
the list of DAS/2 servers.

Although the server's XML responses conform to spec v.300, the query
strings it recognizes still only conform to a subset of spec v.200.  I
expect to have the queries upgraded to v.300 tonight.  But it will
probably still only support a subset of the query filters: one type
(required), one overlaps (required), one inside (optional).

This server also supports bed, psl, and some binary formats as
alternative content formats, depending on the type of the annotations.

	gregg

> -----Original Message-----
> From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-
> bio.org] On Behalf Of Steve Chervitz
> Sent: Wednesday, March 15, 2006 1:25 PM
> To: DAS/2
> Subject: [DAS2] New affymetrix das/2 development server
> 
> 
> Gregg's latest spec-compliant, but still development-grade, das/2
server
> is
> now publically available via http://205.217.46.81:9091
> 
> It's currently serving annotations from the following assemblies:
> - human hg16
> - human hg17
> - drosophila dm2
> 
> Send me requests for any other data sources that would help your
> development
> efforts.
> 
> Example query to get back a das-source xml document:
> http://205.217.46.81:9091/das2/genome/sequence
> 
> It's compliance with the spec is steadily improving, on a daily if not
> hourly basis during the code sprint.
> 
> Within IGB you can access this server from the DAS/2 servers tab
> under 'Affy-temp'.
> 
> You'll need the latest version of IGB from the CVS repository at
> http://sf.net/projects/genoviz
> 
> Steve
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Fri Mar 17 11:09:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 17 Mar 2006 08:09:44 -0800
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <Pine.LNX.4.58.0603161027260.30576@sumo.ctrl.ucla.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
	<Pine.LNX.4.58.0603161027260.30576@sumo.ctrl.ucla.edu>
Message-ID: <e5e618da98f01b9a6cbc2b97cc8e34e1@dalkescientific.com>

Allen:
> Please just code to what is there, and expect your code to break when I
> update the biopackages server to v300 (probably next week).

So you all know, "300" is what we've been calling the current
version of the spec, based on the code freeze that started 8
hours ago.  It's the one currently only described in the schema
definitions and in the example files under das/das2/draft3.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Fri Mar 17 11:40:20 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 17 Mar 2006 08:40:20 -0800
Subject: [DAS2] proxies, caching and network configuration
Message-ID: <58f16cd7fac095a708fd81a5cc5e40df@dalkescientific.com>

I'm writing to encourage DAS client authors to include
support for proxies when fetching DAS URLs.

Nomi pointed out that Apollo supports proxies, because
users asked for it.  I think it's because some sites
don't have direct access to the internet.  I know a few
of my clients have internal networks set up that way.

Yesterday we talked a bit about how to point to local
mirrors.  It would be hard to have a standard configuration
so that all DAS client code can know about local mirrors.
I mentioned setting up proxies, but dismissed the idea.

Now I'm thinking that that might be the solution.  If
there are local ways to get, say, sequence data then
that could be done at the proxy level.  Someone can easily
(with less than 100 lines of code) write a new proxy
server which points to a local resource if it knows that
a URI is resolvable that way.

Having proxy support also helps with debugging, like in
the debugging proxy server I wrote yesterday.

A nice thing is that some people want proxy support
anyway, so if client code supports proxies then these
other things (redirection to local mirrors, debugging)
can be set up later, and with no extra work in the client.

					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Fri Mar 17 13:47:51 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Fri, 17 Mar 2006 10:47:51 -0800
Subject: [DAS2] New affymetrix das/2 development server
In-Reply-To: <C03DC62B.1D090%Steve_Chervitz@affymetrix.com>
Message-ID: <C0404457.1D185%Steve_Chervitz@affymetrix.com>


The affy das/2 development server at http://205.217.46.81:9091 has been
updated to better support DAS/2 spec version 300.

Gregg says:
> Changed genometry DAS/2 server so that it responds to feature queries that use
> DAS/2 v.300 feature filters.  Currently implements a subset of
> the v.300 feature query spec:
>     requires one and only one segment filter
>     requires one and only one type filter
>     accepts zero or one inside filter
> Also attempts to support DAS/2 v.200 feature filters, but success is not
> guaranteed.

Steve 

> From: Steve Chervitz <Steve_Chervitz at affymetrix.com>
> Date: Wed, 15 Mar 2006 13:24:59 -0800
> To: DAS/2 <das2 at lists.open-bio.org>
> Conversation: New affymetrix das/2 development server
> Subject: New affymetrix das/2 development server
> 
> 
> Gregg's latest spec-compliant, but still development-grade, das/2 server is
> now publically available via http://205.217.46.81:9091
> 
> It's currently serving annotations from the following assemblies:
> - human hg16 
> - human hg17 
> - drosophila dm2
> 
> Send me requests for any other data sources that would help your development
> efforts.
> 
> Example query to get back a das-source xml document:
> http://205.217.46.81:9091/das2/genome/sequence
> 
> It's compliance with the spec is steadily improving, on a daily if not hourly
> basis during the code sprint.
> 
> Within IGB you can access this server from the DAS/2 servers tab
> under 'Affy-temp'.
> 
> You'll need the latest version of IGB from the CVS repository at
> http://sf.net/projects/genoviz
> 
> Steve


From dalke at dalkescientific.com  Fri Mar 17 15:09:42 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 17 Mar 2006 12:09:42 -0800
Subject: [DAS2] defined minimum limits
Message-ID: <dccc4a7c5783153b73b7537073df36e7@dalkescientific.com>

We should define minimum sizes for fields in the server database.
For example, "the server must support feature titles of at least
40 characters", "must handle at least two 'excludes' feature filters".

And define what do to when the server decides that writeback of
a 30MB feature is just a bit too large.

					Andrew
					dalke at dalkescientific.com


From boconnor at ucla.edu  Fri Mar 17 18:23:09 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Fri, 17 Mar 2006 15:23:09 -0800
Subject: [DAS2] das.biopackages.net Updated to Spec 300
Message-ID: <441B44DD.5010505@ucla.edu>

Hi,

So I checked in my changes to the DAS/2 server which should bring it up 
to the 300 spec.  Allen updated the das.biopackages.net server and I 
tested the following URLs in Andrew's validation app.  They all appear 
to be OK:

* http://das.biopackages.net/das/genome
* http://das.biopackages.net/das/genome/yeast
* http://das.biopackages.net/das/genome/human
* http://das.biopackages.net/das/genome/yeast/S228C
* http://das.biopackages.net/das/genome/human/17
* http://das.biopackages.net/das/genome/yeast/S228C/segment
* http://das.biopackages.net/das/genome/human/17/segment
* http://das.biopackages.net/das/genome/yeast/S228C/type
* http://das.biopackages.net/das/genome/human/17/type
* 
http://das.biopackages.net/das/genome/yeast/S228C/feature?overlaps=chrI/1:1000
* 
http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr1/1000:2000

Let Allen or I know if you run into problems.

--Brian


From cjm at fruitfly.org  Fri Mar 17 19:20:14 2006
From: cjm at fruitfly.org (chris mungall)
Date: Fri, 17 Mar 2006 16:20:14 -0800
Subject: [DAS2] query language description
In-Reply-To: <c4433b247b29525254354103b60ce414@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
	<e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
	<8b7582943da22dfed23ba7b5386402fb@fruitfly.org>
	<c4433b247b29525254354103b60ce414@dalkescientific.com>
Message-ID: <da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>


On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote:

>> right now they are forced bypass the constraint language and go direct
>> to SQL.
>
> In addition, we provide defined ways for a server to indicate
> that there are additional ways to query the server.

I was positing this as a bad feature, not a good one. or at least a 
symptom of an incorrectly designed system (at least in the case of the 
GO DB API - it may not carry forward to DAS - though if you're going to 
allow querying by terms...)

>
>> None of these really lit into the DAS paradigm. I'm guessing you want
>> something simple that can be used as easily as an API with get-by-X
>> methods but will seamlessly blend into something more powerful. I
>> think what you have is on the right lines. I'm just arguing to make
>> this language composable from the outset, so that it can be extended
>> to whatever expressivity is required in the future, without bolting on
>> a new query system that's incompatible with the existing one.
>
> We have two ways to compose the system.  If the simple query language
> is extended, for example, to support word searches of the text field
> instead of substring searches, then a server can say
>
> <CAPABILITY type="features"
> query_uri="http://somewhere.over.rainbow/server.cgi">
>    <SUPPORTS name="word-search"/>
> </CAPABILITY>
>
> This is backwards compatible, so the normal DAS queries work.  But
> a client can recognize the new feature and support whatever new filters
> that 'word-search' indicates, eg
>
>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*
>
> (finds features with notes containing words starting with 'Andre' )
>
> These are composable.  For example, suppose Sanger allows modification
> date searches of curation events.  Then it might say
>
> <CAPABILITY type="features"
> query_uri="http://somewhere.over.rainbow/server.cgi">
>    <SUPPORTS name="word-search"/>
>    <SUPPORTS name="sanger-curation"/>
> </CAPABILITY>

so this is limited to single-argument search functions?

>
> and I can search for notes containing words starting with "Andre"
> which were modified by "dalke" between 2002 and 2005 by doing
>
>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
>         modified-by=dalke&modified-before=2005&modified-after=2002

but the compositionality is always associative since the CGI parameter 
constraint forbids nesting

> An advantage to the simple boolean logic of the current system
> is that the GUI interface is easy, and in line with existing
> simple search systems.

there's nothing preventing you from implementing a simple GUI on top of 
an expressive system - there is nothing forcing you to use the 
expressivity

> If someone wants to implement a new search system which is
> not backwards compatible then the server can indicate that
> alternative with a new CAPABILITY.  Suppose Thomas at Sanger
> comes up with a new search mechanism based on an object query
> language he invented,
>
> <CAPABILITY type="down-oql"
>      query_uri="http://sanger.ac.uk/oql-search" />
>
> The Sanger and EBI clients might understand that and support
> a more complex GUI, eg, with a text box interface.  Everyone
> else must ignore unknown capability types.

but this doesn't integrate with the existing query system

>
> Then that would be POSTED (or whatever the protocol defines)
> to the given URL, which returns back whatever results are
> desired.
>
> Or the server can point to a public MySQL port, like
>
> <CAPABILITY type="mysql-connection"
>      query_uri="mysql://username:password at hostname:port/databasename" 
> />
>
> That's what you are doing to bypass the syntax, except that
> here it isn't a bypass; you can define the new interface in
> the DAS sources document.
>
>> The generic language could just be some kind of simple
>> extensible function syntax for search terms, boolean operators,
>> and some kind of (optional) nesting syntax.
>
> Which syntax?  Is it supposed to be easy for people to write?
> Text oriented?  Or tree structured, like XML, or SQL-like?

I'd favour some concrete asbtract syntax that looks much like the 
existing DAS QL

> And which clients and servers will implement that search
> language?

all servers. clients optional

>
> If there was a generic language it would allow
>    OR("on segment Chr1 between 1000 and 2000",
>       "on segment ChrX between 99 and 777")
> which is something we are expressly not allowing in DAS2
> queries.  It doesn't make sense for the target applications
> and by excluding it it simplifies the server development,
> which means less chance for bugs.

this example is pointless but it's easy to imagine plenty of ontology 
term queries or other queries in which this would be useful

I guess I depart from the normal DAS philosophy - I don't see this 
being a high barrier for entry for servers, if they're not up to this 
it'll probably be a buggy hacky server anyway

> Also, I personally haven't figured out a decent way to
> do a GUI composition of a complex boolean query which is
> as easy as learning the query language in the first place.

doesn't mean it doesn't exist.

i'm not sure what's hard about having say, a clipboard of favourite 
queries, then allowing some kind of drag-and-drop composition

> A more generic language implementation is a lot of overhead
> if most (80%? 90%) need basic searches, and many of the
> rest can fake it by breaking a request into parts and
> doing the boolean logic on the client side.

this is always an option - if the user doesn't mind the additional 
possibly very high overhead. it's just a little bit of a depressing 
approach, as if Codd's seminal paper from 1970 or whenever it was never 
happened.

> Feedback I've heard so far is that DAS1 queries were
> acceptable, with only a few new search fields needed.
>
>> hmm, not sure how useful this would be - surely you'd want something
>> more dasmodel-aware?
>
> The example I gave was a bad one.  What I meant was to show
> how there's an extension point so someone can develop a new
> search interface and clients can know that the new functionality
> exists, without having to change the DAS spec.

ok

that's probably all I've got to say on the matter, sorry for being 
irksome. I guess I'm fundamentally missing something, that is, why wrap 
simple and expressive declarative query languages with limited ad-hoc 
constraint systems with consciously limited expressivity and limited 
means of extensibility..

cheers
chris

>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From Steve_Chervitz at affymetrix.com  Sun Mar 19 23:54:36 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Sun, 19 Mar 2006 20:54:36 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006
Message-ID: <C043758C.1D23A%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day five, 17 Mar 2006

$Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Dalke Scientific: Andrew Dalke (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

Agenda: 

* Status reports
* Writeback progress


Status reports:
---------------

gh: This is the last mtg of code sprint. For the status reports, focus
on where you are at and what you are hoping to accomplish post-sprint.

gh: working on version of affy server that impls das/2 v300 spec for
all xml responses. sample responses passed andrew's validation.
steve rolled it out to public server.

updated igb client to handle v300 xml.
worked more on server to impl v300 query syntax using full uri for
type segment, segment separate from overlaps and inside.
only impls a subset of the feature query. requires one and only one
segment, type, insides.

hoping todo for rest of sprint and after:
1. supporting name feat filters in igb client
2. remove restrictions from the server
3. making sure new version of server gets rolled out,
4. roll out jar for this version of igb. maybe put on genoviz sf site for
testing purposes.

bo: looked at xml docs that andrew checked in, updating ucla templates
on server, not rolled out to biopackages.net, waiting to make rpm,
hoping to do code cleanup in igb.
getting andrew's help running validator on local copy of server.

gh: igb would like to support v300, but one server is v200+ (ucla),
one at v300 (affy) complicates things. so getting your server good to
go would be my priority.

bo: code clean up involves assay and ontology interface.

gh: we're planning an igb release at end of march. as long as the code
is clean by then it's ok.

aday: code cleanup, things removed from protocol. exporting data
matrices from assay part of server.
validate sources document w/r/t v300 validator. work with brian to
make sure everything is update to v300. probably working on fiter
query, since we now treat things as names not full uri's.

ad: what extra config info do you need in server for that? can you get
it from the http headers?
gh: mine is being promiscuous, just name of type will work. might give
the wrong thing back, but for data we're serving back now, it can't be
wrong.

ad: how much trouble does the uri handling cause for you?

gh: has to be full uri of the type, doing otherwise is not an option
(in the spec).
ad: you could just use name internally, then put together full uri
when you go to the outside world.

ad: I updated comments in schema definitions, updated query lang
spec. string searches are substring searches not word-substring
searches. 
abc = whole field must be equal
*abc = suffix match
abc* = prefix match

previously said it was word match, but that's too complicated on
server.
worked with gregg to pin down what inside search means.

I'm thinking about the possibility of a validating proxy server,
configure das client to go through proxy before outside world, the
server would sniff everything going by.
Support for proxys can enable lots of sorts of things w/o needing
additional config for each client.

gh: how do you do proxy in java? i.e., redirect all network calls to a
proxy.
bo: there's a way to set proxy options via the system object in the
java vm. can show you some examples of this.

aday: performance.
gh: current webstart based ibg works with the existing public das/2
server, [comment pertaining to: the new version of igb and a new
version of the affy das/2 server].

ad: when will we get reference names from lincoln?
gh: should happen yesterday. poke him about this.
would be really nice to be able to overlay anotations!

The current version of igb can turn off v300 options, and then ti can
load stuff from the ucla server. The version of igb in cvs now can hit
both biopackages.net and affy server in the dmz. and there's
hardwiring to get things to overlay. temporary patch.

ee: two things:
1. style sheets. info from andrew yesterday. looking over that. will
   discuss questions w/ andrew.
2. making sure that when we do a new release of igb in a couple of
   weeks (when I'm not here) that it will go smoothly . go over w/
   gregg, steve. lots of testing.
made changes in parser code, should still work.

sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com.
ee: it's the das/1 I'm most concerned about.

sc: installed and updated gregg's new das/2 server on a publically
accessible machine (separate box from the production das/1 and das/2
servers on netaffxdas.affymetrix.com).
Also spent a time loading data for new affy arrays (mouse rat
exons). this required lots of memory, had to disable support for some
other arrays. [gregg's das servers load all annotations into memory at
start up, hance the big memory requirements for arrays with lots of
probe sets.]

[A] gregg optimize affy das server memory reqts for exon arrays.

gh: we' gotten a lot done this week. I think we have a stable spec.

gh: serving alignments, no cigars, but blat alignment to genome as
coords on mrna and coords on the genome. igb doesn't use it yet, but
it's there.
ad: xid in region elements.
gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to
xid in das/2?
ad: yes. i believe
gh: if you have links in das/1. without links it can build links from
feature id using a template. This is used for building links from
within IGB back to netaffx, for example.

Topic: Writebacks
-----------------

gh: writebacks haven't been mentioned at all this week.
ad: we need people committed to writing a server to implement it.
gh: we decided that since ed griffith would be working on it at
Sanger, we wouldn't worry about it for ucla server.
bo: we started prototyping. locking mechanism. persisting part of a
mage document. the spec changed after that. andrew's delta model. a
little different from what we were prototyping.
actual persistence will be done in the assay portion of our server.
gh: grant focuses on write back for genome portion, and this was a big
chunk of the grant. ends in end of may or june.

ad: delta model was: here's a list of add, delete, modify in one
document. An issue was if you change an existing record, do you give
it a new identifier?
gh: you never modify something with an existing id, just make a new
one, new id, with a pointer back to old one. Ed Griffith said this a
month ago. I like this idea. but told we cannot make this requirement
on the database. but very few dbs will be writeback, so it's not
affecting all servers

ad: making new uris, client has to know the new uri for the old
one. needs to return a mapping document.
if network crashes partway through, client won't know mapping is and
will be lost.
gh: server doesn't know if client got it. it could act(?) back.

gh: if a response from http server dies, server has no way to know.
ad: There could be a proxy in the middle, or isp's proxy server. The
server sent it successfully to the proxy, but never made it to the
client. 

gh: how is this dealt with for commits into relational dbs? same thing
applies 
ad: don't know
ee: could ask for everything in this region.
ad: have a new element that says 'i used to be this'.
bo: you do an insert in a db, to get last pk that was issued. client
talks back to server, give me last feature uri that was provisioned on
my connection. so the client is in control.

sc: it's up to client to get confirmation from server. If it failed to
get the response after sending in the modification request, it could
request that the server send it again.

ad: (drawing on whiteboard) two stage strategy, get a transaction state.

     post "get transaction url"
    <---------------
    post (put?) to transaction URL
    ------------->
    can do multiple (if identical)
       ---------->
       ---------->
    Get was successful and here's transformation info
    <---------------

ad: server can hold transformation info for some timespan in case
client needs to re-fetch.

gh: I'm more insterested in getting a server up than a client
regarding writeback. complex parts of the client are already
implemented (apollo).

gh: locks are region based not feature based.
ad: can't lock...

gh: we can talk about how to trigger ucla locking mechanism.
bo: did flock transactional locking the suggested in perl
cookbook. mage document has content. server locks an id using flock,
(for assay das).
gh: to lock a region on the genome, lock on all ids for features in
this region.
bo: make a file containing all the ids that are locked. flock this
file. 

ad: file locking is frought with problems. why not keep it in the
database and let the db lock it for you. don't let perl + file system
do it for you. there could be fs problems. nfs isn't good at that. a
database is much more reliable.

bo: I went with perl flock mechanism since you could have other
non-database sources (though so far it's all db).

[A] steve, allen send brian code tips regarding locking.

gh: putting aside pushing large data chunks into the server, for
curation it's ok if protocol is a little error prone, since the
curator-caused errors will be much more likely/common.

ad: UK folks haven't done any writeback work as far as I know.
gh: they haven't billed us in 2 years. Tony cox is contact, ed
griffith is main developer.
ad: andreas and thomas are not funded by this grant or the next one.
gh: they are already funded by other means.

ad: if someone want's to change an annotation should they need to get
a lock first or can it work like cvs? do it if it can, get lock,
release lock in one transaction.
ee: that's my preference.

ad: if every feature has it's own id, you know if it's...

ee: some servers might not have any writeback facility at
all. conflicts will be rare.

[A] ask ed/tony on whether they plan to have any writeback facility

gh: ed g wanted to work on client to do writeback, don't know who
would work on a server there.
ad: someone else, can't remember - roy?
gh: unless we hear back from sanger, the highest priority for ucla
folks after updating server for v300, is working server-side
writeback. 

gh: spec freeze is for the read portion. the writeback portion will
have to change as needed.
ad: and arithmetic? ;-)


From lstein at cshl.edu  Mon Mar 20 12:27:59 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 20 Mar 2006 12:27:59 -0500
Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006
In-Reply-To: <C043758C.1D23A%Steve_Chervitz@affymetrix.com>
References: <C043758C.1D23A%Steve_Chervitz@affymetrix.com>
Message-ID: <200603201227.59816.lstein@cshl.edu>

Hi Folks,

I will join the DAS2 call a little late today (no more than 10 min). I'm 
assuming that we're on?

Lincoln

On Sunday 19 March 2006 23:54, Steve Chervitz wrote:
> Notes from DAS/2 code sprint #2, day five, 17 Mar 2006
>
> $Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $
>
> Note taker: Steve Chervitz
>
> Attendees:
>   Affy: Steve Chervitz, Ed E., Gregg Helt
>   Dalke Scientific: Andrew Dalke (at Affy)
>   UCLA: Allen Day, Brian O'Connor (at Affy)
>
> Action items are flagged with '[A]'.
>
> These notes are checked into the biodas.org CVS repository at
> das/das2/notes/2006. Instructions on how to access this
> repository are at http://biodas.org
>
> DISCLAIMER:
> The note taker aims for completeness and accuracy, but these goals are
> not always achievable, given the desire to get the notes out with a
> rapid turnaround. So don't consider these notes as complete minutes
> from the meeting, but rather abbreviated, summarized versions of what
> was discussed. There may be errors of commission and omission.
> Participants are welcome to post comments and/or corrections to these
> as they see fit.
>
> Agenda:
>
> * Status reports
> * Writeback progress
>
>
> Status reports:
> ---------------
>
> gh: This is the last mtg of code sprint. For the status reports, focus
> on where you are at and what you are hoping to accomplish post-sprint.
>
> gh: working on version of affy server that impls das/2 v300 spec for
> all xml responses. sample responses passed andrew's validation.
> steve rolled it out to public server.
>
> updated igb client to handle v300 xml.
> worked more on server to impl v300 query syntax using full uri for
> type segment, segment separate from overlaps and inside.
> only impls a subset of the feature query. requires one and only one
> segment, type, insides.
>
> hoping todo for rest of sprint and after:
> 1. supporting name feat filters in igb client
> 2. remove restrictions from the server
> 3. making sure new version of server gets rolled out,
> 4. roll out jar for this version of igb. maybe put on genoviz sf site for
> testing purposes.
>
> bo: looked at xml docs that andrew checked in, updating ucla templates
> on server, not rolled out to biopackages.net, waiting to make rpm,
> hoping to do code cleanup in igb.
> getting andrew's help running validator on local copy of server.
>
> gh: igb would like to support v300, but one server is v200+ (ucla),
> one at v300 (affy) complicates things. so getting your server good to
> go would be my priority.
>
> bo: code clean up involves assay and ontology interface.
>
> gh: we're planning an igb release at end of march. as long as the code
> is clean by then it's ok.
>
> aday: code cleanup, things removed from protocol. exporting data
> matrices from assay part of server.
> validate sources document w/r/t v300 validator. work with brian to
> make sure everything is update to v300. probably working on fiter
> query, since we now treat things as names not full uri's.
>
> ad: what extra config info do you need in server for that? can you get
> it from the http headers?
> gh: mine is being promiscuous, just name of type will work. might give
> the wrong thing back, but for data we're serving back now, it can't be
> wrong.
>
> ad: how much trouble does the uri handling cause for you?
>
> gh: has to be full uri of the type, doing otherwise is not an option
> (in the spec).
> ad: you could just use name internally, then put together full uri
> when you go to the outside world.
>
> ad: I updated comments in schema definitions, updated query lang
> spec. string searches are substring searches not word-substring
> searches.
> abc = whole field must be equal
> *abc = suffix match
> abc* = prefix match
>
> previously said it was word match, but that's too complicated on
> server.
> worked with gregg to pin down what inside search means.
>
> I'm thinking about the possibility of a validating proxy server,
> configure das client to go through proxy before outside world, the
> server would sniff everything going by.
> Support for proxys can enable lots of sorts of things w/o needing
> additional config for each client.
>
> gh: how do you do proxy in java? i.e., redirect all network calls to a
> proxy.
> bo: there's a way to set proxy options via the system object in the
> java vm. can show you some examples of this.
>
> aday: performance.
> gh: current webstart based ibg works with the existing public das/2
> server, [comment pertaining to: the new version of igb and a new
> version of the affy das/2 server].
>
> ad: when will we get reference names from lincoln?
> gh: should happen yesterday. poke him about this.
> would be really nice to be able to overlay anotations!
>
> The current version of igb can turn off v300 options, and then ti can
> load stuff from the ucla server. The version of igb in cvs now can hit
> both biopackages.net and affy server in the dmz. and there's
> hardwiring to get things to overlay. temporary patch.
>
> ee: two things:
> 1. style sheets. info from andrew yesterday. looking over that. will
>    discuss questions w/ andrew.
> 2. making sure that when we do a new release of igb in a couple of
>    weeks (when I'm not here) that it will go smoothly . go over w/
>    gregg, steve. lots of testing.
> made changes in parser code, should still work.
>
> sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com.
> ee: it's the das/1 I'm most concerned about.
>
> sc: installed and updated gregg's new das/2 server on a publically
> accessible machine (separate box from the production das/1 and das/2
> servers on netaffxdas.affymetrix.com).
> Also spent a time loading data for new affy arrays (mouse rat
> exons). this required lots of memory, had to disable support for some
> other arrays. [gregg's das servers load all annotations into memory at
> start up, hance the big memory requirements for arrays with lots of
> probe sets.]
>
> [A] gregg optimize affy das server memory reqts for exon arrays.
>
> gh: we' gotten a lot done this week. I think we have a stable spec.
>
> gh: serving alignments, no cigars, but blat alignment to genome as
> coords on mrna and coords on the genome. igb doesn't use it yet, but
> it's there.
> ad: xid in region elements.
> gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to
> xid in das/2?
> ad: yes. i believe
> gh: if you have links in das/1. without links it can build links from
> feature id using a template. This is used for building links from
> within IGB back to netaffx, for example.
>
> Topic: Writebacks
> -----------------
>
> gh: writebacks haven't been mentioned at all this week.
> ad: we need people committed to writing a server to implement it.
> gh: we decided that since ed griffith would be working on it at
> Sanger, we wouldn't worry about it for ucla server.
> bo: we started prototyping. locking mechanism. persisting part of a
> mage document. the spec changed after that. andrew's delta model. a
> little different from what we were prototyping.
> actual persistence will be done in the assay portion of our server.
> gh: grant focuses on write back for genome portion, and this was a big
> chunk of the grant. ends in end of may or june.
>
> ad: delta model was: here's a list of add, delete, modify in one
> document. An issue was if you change an existing record, do you give
> it a new identifier?
> gh: you never modify something with an existing id, just make a new
> one, new id, with a pointer back to old one. Ed Griffith said this a
> month ago. I like this idea. but told we cannot make this requirement
> on the database. but very few dbs will be writeback, so it's not
> affecting all servers
>
> ad: making new uris, client has to know the new uri for the old
> one. needs to return a mapping document.
> if network crashes partway through, client won't know mapping is and
> will be lost.
> gh: server doesn't know if client got it. it could act(?) back.
>
> gh: if a response from http server dies, server has no way to know.
> ad: There could be a proxy in the middle, or isp's proxy server. The
> server sent it successfully to the proxy, but never made it to the
> client.
>
> gh: how is this dealt with for commits into relational dbs? same thing
> applies
> ad: don't know
> ee: could ask for everything in this region.
> ad: have a new element that says 'i used to be this'.
> bo: you do an insert in a db, to get last pk that was issued. client
> talks back to server, give me last feature uri that was provisioned on
> my connection. so the client is in control.
>
> sc: it's up to client to get confirmation from server. If it failed to
> get the response after sending in the modification request, it could
> request that the server send it again.
>
> ad: (drawing on whiteboard) two stage strategy, get a transaction state.
>
>      post "get transaction url"
>     <---------------
>     post (put?) to transaction URL
>     ------------->
>     can do multiple (if identical)
>        ---------->
>        ---------->
>     Get was successful and here's transformation info
>     <---------------
>
> ad: server can hold transformation info for some timespan in case
> client needs to re-fetch.
>
> gh: I'm more insterested in getting a server up than a client
> regarding writeback. complex parts of the client are already
> implemented (apollo).
>
> gh: locks are region based not feature based.
> ad: can't lock...
>
> gh: we can talk about how to trigger ucla locking mechanism.
> bo: did flock transactional locking the suggested in perl
> cookbook. mage document has content. server locks an id using flock,
> (for assay das).
> gh: to lock a region on the genome, lock on all ids for features in
> this region.
> bo: make a file containing all the ids that are locked. flock this
> file.
>
> ad: file locking is frought with problems. why not keep it in the
> database and let the db lock it for you. don't let perl + file system
> do it for you. there could be fs problems. nfs isn't good at that. a
> database is much more reliable.
>
> bo: I went with perl flock mechanism since you could have other
> non-database sources (though so far it's all db).
>
> [A] steve, allen send brian code tips regarding locking.
>
> gh: putting aside pushing large data chunks into the server, for
> curation it's ok if protocol is a little error prone, since the
> curator-caused errors will be much more likely/common.
>
> ad: UK folks haven't done any writeback work as far as I know.
> gh: they haven't billed us in 2 years. Tony cox is contact, ed
> griffith is main developer.
> ad: andreas and thomas are not funded by this grant or the next one.
> gh: they are already funded by other means.
>
> ad: if someone want's to change an annotation should they need to get
> a lock first or can it work like cvs? do it if it can, get lock,
> release lock in one transaction.
> ee: that's my preference.
>
> ad: if every feature has it's own id, you know if it's...
>
> ee: some servers might not have any writeback facility at
> all. conflicts will be rare.
>
> [A] ask ed/tony on whether they plan to have any writeback facility
>
> gh: ed g wanted to work on client to do writeback, don't know who
> would work on a server there.
> ad: someone else, can't remember - roy?
> gh: unless we hear back from sanger, the highest priority for ucla
> folks after updating server for v300, is working server-side
> writeback.
>
> gh: spec freeze is for the read portion. the writeback portion will
> have to change as needed.
> ad: and arithmetic? ;-)
>
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Mon Mar 20 12:32:40 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 20 Mar 2006 12:32:40 -0500
Subject: [DAS2] query language description
In-Reply-To: <da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<c4433b247b29525254354103b60ce414@dalkescientific.com>
	<da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>
Message-ID: <200603201232.41522.lstein@cshl.edu>

The current filter query language, which provides one level of ANDs and a 
nested level of ORs, satisfies our use cases. It is not clear to me what 
additional benefit we'll get from a composable query language. Note that none 
of the popular and functional genome information sources -- NCBI, UCSC, 
Ensembl or BioMart -- offer a composable query language, and there does not 
seem to be rioting on the streets!

Lincoln


On Friday 17 March 2006 19:20, chris mungall wrote:
> On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote:
> >> right now they are forced bypass the constraint language and go direct
> >> to SQL.
> >
> > In addition, we provide defined ways for a server to indicate
> > that there are additional ways to query the server.
>
> I was positing this as a bad feature, not a good one. or at least a
> symptom of an incorrectly designed system (at least in the case of the
> GO DB API - it may not carry forward to DAS - though if you're going to
> allow querying by terms...)
>
> >> None of these really lit into the DAS paradigm. I'm guessing you want
> >> something simple that can be used as easily as an API with get-by-X
> >> methods but will seamlessly blend into something more powerful. I
> >> think what you have is on the right lines. I'm just arguing to make
> >> this language composable from the outset, so that it can be extended
> >> to whatever expressivity is required in the future, without bolting on
> >> a new query system that's incompatible with the existing one.
> >
> > We have two ways to compose the system.  If the simple query language
> > is extended, for example, to support word searches of the text field
> > instead of substring searches, then a server can say
> >
> > <CAPABILITY type="features"
> > query_uri="http://somewhere.over.rainbow/server.cgi">
> >    <SUPPORTS name="word-search"/>
> > </CAPABILITY>
> >
> > This is backwards compatible, so the normal DAS queries work.  But
> > a client can recognize the new feature and support whatever new filters
> > that 'word-search' indicates, eg
> >
> >    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*
> >
> > (finds features with notes containing words starting with 'Andre' )
> >
> > These are composable.  For example, suppose Sanger allows modification
> > date searches of curation events.  Then it might say
> >
> > <CAPABILITY type="features"
> > query_uri="http://somewhere.over.rainbow/server.cgi">
> >    <SUPPORTS name="word-search"/>
> >    <SUPPORTS name="sanger-curation"/>
> > </CAPABILITY>
>
> so this is limited to single-argument search functions?
>
> > and I can search for notes containing words starting with "Andre"
> > which were modified by "dalke" between 2002 and 2005 by doing
> >
> >    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
> >         modified-by=dalke&modified-before=2005&modified-after=2002
>
> but the compositionality is always associative since the CGI parameter
> constraint forbids nesting
>
> > An advantage to the simple boolean logic of the current system
> > is that the GUI interface is easy, and in line with existing
> > simple search systems.
>
> there's nothing preventing you from implementing a simple GUI on top of
> an expressive system - there is nothing forcing you to use the
> expressivity
>
> > If someone wants to implement a new search system which is
> > not backwards compatible then the server can indicate that
> > alternative with a new CAPABILITY.  Suppose Thomas at Sanger
> > comes up with a new search mechanism based on an object query
> > language he invented,
> >
> > <CAPABILITY type="down-oql"
> >      query_uri="http://sanger.ac.uk/oql-search" />
> >
> > The Sanger and EBI clients might understand that and support
> > a more complex GUI, eg, with a text box interface.  Everyone
> > else must ignore unknown capability types.
>
> but this doesn't integrate with the existing query system
>
> > Then that would be POSTED (or whatever the protocol defines)
> > to the given URL, which returns back whatever results are
> > desired.
> >
> > Or the server can point to a public MySQL port, like
> >
> > <CAPABILITY type="mysql-connection"
> >      query_uri="mysql://username:password at hostname:port/databasename"
> > />
> >
> > That's what you are doing to bypass the syntax, except that
> > here it isn't a bypass; you can define the new interface in
> > the DAS sources document.
> >
> >> The generic language could just be some kind of simple
> >> extensible function syntax for search terms, boolean operators,
> >> and some kind of (optional) nesting syntax.
> >
> > Which syntax?  Is it supposed to be easy for people to write?
> > Text oriented?  Or tree structured, like XML, or SQL-like?
>
> I'd favour some concrete asbtract syntax that looks much like the
> existing DAS QL
>
> > And which clients and servers will implement that search
> > language?
>
> all servers. clients optional
>
> > If there was a generic language it would allow
> >    OR("on segment Chr1 between 1000 and 2000",
> >       "on segment ChrX between 99 and 777")
> > which is something we are expressly not allowing in DAS2
> > queries.  It doesn't make sense for the target applications
> > and by excluding it it simplifies the server development,
> > which means less chance for bugs.
>
> this example is pointless but it's easy to imagine plenty of ontology
> term queries or other queries in which this would be useful
>
> I guess I depart from the normal DAS philosophy - I don't see this
> being a high barrier for entry for servers, if they're not up to this
> it'll probably be a buggy hacky server anyway
>
> > Also, I personally haven't figured out a decent way to
> > do a GUI composition of a complex boolean query which is
> > as easy as learning the query language in the first place.
>
> doesn't mean it doesn't exist.
>
> i'm not sure what's hard about having say, a clipboard of favourite
> queries, then allowing some kind of drag-and-drop composition
>
> > A more generic language implementation is a lot of overhead
> > if most (80%? 90%) need basic searches, and many of the
> > rest can fake it by breaking a request into parts and
> > doing the boolean logic on the client side.
>
> this is always an option - if the user doesn't mind the additional
> possibly very high overhead. it's just a little bit of a depressing
> approach, as if Codd's seminal paper from 1970 or whenever it was never
> happened.
>
> > Feedback I've heard so far is that DAS1 queries were
> > acceptable, with only a few new search fields needed.
> >
> >> hmm, not sure how useful this would be - surely you'd want something
> >> more dasmodel-aware?
> >
> > The example I gave was a bad one.  What I meant was to show
> > how there's an extension point so someone can develop a new
> > search interface and clients can know that the new functionality
> > exists, without having to change the DAS spec.
>
> ok
>
> that's probably all I've got to say on the matter, sorry for being
> irksome. I guess I'm fundamentally missing something, that is, why wrap
> simple and expressive declarative query languages with limited ad-hoc
> constraint systems with consciously limited expressivity and limited
> means of extensibility..
>
> cheers
> chris
>
> > 					Andrew
> > 					dalke at dalkescientific.com
> >
> > _______________________________________________
> > DAS2 mailing list
> > DAS2 at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/das2
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From Gregg_Helt at affymetrix.com  Mon Mar 20 12:40:19 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 20 Mar 2006 09:40:19 -0800
Subject: [DAS2] call today?
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA47@msex02.affymetrix.com>

Apologies, I forgot to post that today's DAS/2 teleconference was
cancelled.  The feeling on Friday was that after the code sprint last
week we needed a break.  The teleconference will resume next week on the
regular schedule (Mondays at 9:30 AM Pacific time).

	Thanks,
	Gregg

> -----Original Message-----
> From: Andreas Prlic [mailto:ap3 at sanger.ac.uk]
> Sent: Monday, March 20, 2006 9:02 AM
> To: Andrew Dalke; Helt,Gregg
> Cc: Thomas Down
> Subject: call today?
> 
> Hi Dasians,
> 
> do we have a conference call today?
> 
> Cheers,
> Andreas
> 
>
-----------------------------------------------------------------------
> 
> Andreas Prlic      Wellcome Trust Sanger Institute
>                                Hinxton, Cambridge CB10 1SA, UK
> 			 +44 (0) 1223 49 6891


From cjm at fruitfly.org  Mon Mar 20 18:45:46 2006
From: cjm at fruitfly.org (chris mungall)
Date: Mon, 20 Mar 2006 15:45:46 -0800
Subject: [DAS2] query language description
In-Reply-To: <200603201232.41522.lstein@cshl.edu>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<c4433b247b29525254354103b60ce414@dalkescientific.com>
	<da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>
	<200603201232.41522.lstein@cshl.edu>
Message-ID: <7900d1398d5045a268a5f6fe51af529d@fruitfly.org>


I guess things need to be left open for a DAS/3...

On Mar 20, 2006, at 9:32 AM, Lincoln Stein wrote:

> The current filter query language, which provides one level of ANDs 
> and a
> nested level of ORs, satisfies our use cases. It is not clear to me 
> what
> additional benefit we'll get from a composable query language. Note 
> that none
> of the popular and functional genome information sources -- NCBI, UCSC,
> Ensembl or BioMart -- offer a composable query language, and there 
> does not
> seem to be rioting on the streets!
>
> Lincoln
>
>
> On Friday 17 March 2006 19:20, chris mungall wrote:
>> On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote:
>>>> right now they are forced bypass the constraint language and go 
>>>> direct
>>>> to SQL.
>>>
>>> In addition, we provide defined ways for a server to indicate
>>> that there are additional ways to query the server.
>>
>> I was positing this as a bad feature, not a good one. or at least a
>> symptom of an incorrectly designed system (at least in the case of the
>> GO DB API - it may not carry forward to DAS - though if you're going 
>> to
>> allow querying by terms...)
>>
>>>> None of these really lit into the DAS paradigm. I'm guessing you 
>>>> want
>>>> something simple that can be used as easily as an API with get-by-X
>>>> methods but will seamlessly blend into something more powerful. I
>>>> think what you have is on the right lines. I'm just arguing to make
>>>> this language composable from the outset, so that it can be extended
>>>> to whatever expressivity is required in the future, without bolting 
>>>> on
>>>> a new query system that's incompatible with the existing one.
>>>
>>> We have two ways to compose the system.  If the simple query language
>>> is extended, for example, to support word searches of the text field
>>> instead of substring searches, then a server can say
>>>
>>> <CAPABILITY type="features"
>>> query_uri="http://somewhere.over.rainbow/server.cgi">
>>>    <SUPPORTS name="word-search"/>
>>> </CAPABILITY>
>>>
>>> This is backwards compatible, so the normal DAS queries work.  But
>>> a client can recognize the new feature and support whatever new 
>>> filters
>>> that 'word-search' indicates, eg
>>>
>>>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*
>>>
>>> (finds features with notes containing words starting with 'Andre' )
>>>
>>> These are composable.  For example, suppose Sanger allows 
>>> modification
>>> date searches of curation events.  Then it might say
>>>
>>> <CAPABILITY type="features"
>>> query_uri="http://somewhere.over.rainbow/server.cgi">
>>>    <SUPPORTS name="word-search"/>
>>>    <SUPPORTS name="sanger-curation"/>
>>> </CAPABILITY>
>>
>> so this is limited to single-argument search functions?
>>
>>> and I can search for notes containing words starting with "Andre"
>>> which were modified by "dalke" between 2002 and 2005 by doing
>>>
>>>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
>>>         modified-by=dalke&modified-before=2005&modified-after=2002
>>
>> but the compositionality is always associative since the CGI parameter
>> constraint forbids nesting
>>
>>> An advantage to the simple boolean logic of the current system
>>> is that the GUI interface is easy, and in line with existing
>>> simple search systems.
>>
>> there's nothing preventing you from implementing a simple GUI on top 
>> of
>> an expressive system - there is nothing forcing you to use the
>> expressivity
>>
>>> If someone wants to implement a new search system which is
>>> not backwards compatible then the server can indicate that
>>> alternative with a new CAPABILITY.  Suppose Thomas at Sanger
>>> comes up with a new search mechanism based on an object query
>>> language he invented,
>>>
>>> <CAPABILITY type="down-oql"
>>>      query_uri="http://sanger.ac.uk/oql-search" />
>>>
>>> The Sanger and EBI clients might understand that and support
>>> a more complex GUI, eg, with a text box interface.  Everyone
>>> else must ignore unknown capability types.
>>
>> but this doesn't integrate with the existing query system
>>
>>> Then that would be POSTED (or whatever the protocol defines)
>>> to the given URL, which returns back whatever results are
>>> desired.
>>>
>>> Or the server can point to a public MySQL port, like
>>>
>>> <CAPABILITY type="mysql-connection"
>>>      query_uri="mysql://username:password at hostname:port/databasename"
>>> />
>>>
>>> That's what you are doing to bypass the syntax, except that
>>> here it isn't a bypass; you can define the new interface in
>>> the DAS sources document.
>>>
>>>> The generic language could just be some kind of simple
>>>> extensible function syntax for search terms, boolean operators,
>>>> and some kind of (optional) nesting syntax.
>>>
>>> Which syntax?  Is it supposed to be easy for people to write?
>>> Text oriented?  Or tree structured, like XML, or SQL-like?
>>
>> I'd favour some concrete asbtract syntax that looks much like the
>> existing DAS QL
>>
>>> And which clients and servers will implement that search
>>> language?
>>
>> all servers. clients optional
>>
>>> If there was a generic language it would allow
>>>    OR("on segment Chr1 between 1000 and 2000",
>>>       "on segment ChrX between 99 and 777")
>>> which is something we are expressly not allowing in DAS2
>>> queries.  It doesn't make sense for the target applications
>>> and by excluding it it simplifies the server development,
>>> which means less chance for bugs.
>>
>> this example is pointless but it's easy to imagine plenty of ontology
>> term queries or other queries in which this would be useful
>>
>> I guess I depart from the normal DAS philosophy - I don't see this
>> being a high barrier for entry for servers, if they're not up to this
>> it'll probably be a buggy hacky server anyway
>>
>>> Also, I personally haven't figured out a decent way to
>>> do a GUI composition of a complex boolean query which is
>>> as easy as learning the query language in the first place.
>>
>> doesn't mean it doesn't exist.
>>
>> i'm not sure what's hard about having say, a clipboard of favourite
>> queries, then allowing some kind of drag-and-drop composition
>>
>>> A more generic language implementation is a lot of overhead
>>> if most (80%? 90%) need basic searches, and many of the
>>> rest can fake it by breaking a request into parts and
>>> doing the boolean logic on the client side.
>>
>> this is always an option - if the user doesn't mind the additional
>> possibly very high overhead. it's just a little bit of a depressing
>> approach, as if Codd's seminal paper from 1970 or whenever it was 
>> never
>> happened.
>>
>>> Feedback I've heard so far is that DAS1 queries were
>>> acceptable, with only a few new search fields needed.
>>>
>>>> hmm, not sure how useful this would be - surely you'd want something
>>>> more dasmodel-aware?
>>>
>>> The example I gave was a bad one.  What I meant was to show
>>> how there's an extension point so someone can develop a new
>>> search interface and clients can know that the new functionality
>>> exists, without having to change the DAS spec.
>>
>> ok
>>
>> that's probably all I've got to say on the matter, sorry for being
>> irksome. I guess I'm fundamentally missing something, that is, why 
>> wrap
>> simple and expressive declarative query languages with limited ad-hoc
>> constraint systems with consciously limited expressivity and limited
>> means of extensibility..
>>
>> cheers
>> chris
>>
>>> 					Andrew
>>> 					dalke at dalkescientific.com
>>>
>>> _______________________________________________
>>> DAS2 mailing list
>>> DAS2 at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/das2
>>
>> _______________________________________________
>> DAS2 mailing list
>> DAS2 at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/das2
>
> -- 
> Lincoln D. Stein
> Cold Spring Harbor Laboratory
> 1 Bungtown Road
> Cold Spring Harbor, NY 11724
> FOR URGENT MESSAGES & SCHEDULING,
> PLEASE CONTACT MY ASSISTANT,
> SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Tue Mar 21 18:21:11 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 21 Mar 2006 15:21:11 -0800
Subject: [DAS2] complex features
Message-ID: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>

I've been working on the data model some, trying to get a feel
for complex features.  I've also been evaluating how GFF3 handles
them.

Both use a parent/child link, though GFF3 only has the reference
to the parent while DAS has both.  That means DAS clients can
determine when all of the complex feature have been downloaded.
GFF3 potentially requires waiting until the end of the library,
though there is a way to hint that all the results have been
returned.

Both allow complex graphs.  That is, both allow cycles.  I
assume we are restricting complex features to DAGs, but even
then the following is possible

   [root1]     [root2]     [root3]
     | \          |         /
     |  \         |        /
     |   ------------------
     |  |    node 4        |
     |   ------------------
     |  /
     | /
     |/
  [node 5]

Node 4 has three parents (root1, root2 and root3) and
node 5 has two parents (root1 and node4)

This may or may not make biological sense.  I don't know.  I
only point out that it's there.

I feel that complex annotations must only have a single root
element, even if it's a synthetic one with no location.

Next, consider writeback, with the following two complex features

  [root1]                  [root2]
   |    \                   |
   |      \                 |
   |        \               |
[node1.1]  [node1.2]     [node2.1]


Suppose someone adds a new "connector" node

                       >-->---.
                       |      V
  [root1]              |   [root2]
   |    \              |    |
   |      \            |    |
   |        \          ^    |
[node1.1]  [node1.2]  |  [node2.1]
     |                 |
     V                 |
   [connector]-->--->--^

Should that sort of thing be allowed?  What's the model
for the behavior?

It seems to me there's a missing concept in DAS relating to
complex features.  My model is that the "complex feature" is
its own concept, which  I've been calling an "annotation".
All simple features are annotations.  The connected nodes of
a complex feature are also annotations.

As such, two annotations cannot be combined like this.
Writeback only occurs at the annotation level, in that
new feature elements cannot be used to connect two existing
annotations.

We might also consider having a new interface for annotations
(complex features), so they can be referred to by URI.  I
don't think that's needed right now.


					Andrew
					dalke at dalkescientific.com


From cjm at fruitfly.org  Tue Mar 21 19:43:49 2006
From: cjm at fruitfly.org (chris mungall)
Date: Tue, 21 Mar 2006 16:43:49 -0800
Subject: [DAS2] complex features
In-Reply-To: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>
References: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>
Message-ID: <3879834dc8786f628c68e47a076c1e90@fruitfly.org>


The GFF3 spec says that Parent can only be used to indicate part_of 
relations. If we go by the definition of part_of in the OBO relations 
ontology, or any other definition of part_of (there are many), then 
cycles are explicitly verboten, although the GFF3 docs do not state 
this.

There's no reason in general why part_of graphs should have a single 
root, although it's certainly desirable from a software perspective. 
Dicistronic genes thow a bit of a spanner in the works. There's nothing 
to stop you adding a fake root, or refering to the maximally connected 
graph as an entity in its own right however.

I don't know enough about DAS/2 to be helpful with the writeback 
example. It looks like your example below is a gene merge.

On Mar 21, 2006, at 3:21 PM, Andrew Dalke wrote:

> I've been working on the data model some, trying to get a feel
> for complex features.  I've also been evaluating how GFF3 handles
> them.
>
> Both use a parent/child link, though GFF3 only has the reference
> to the parent while DAS has both.  That means DAS clients can
> determine when all of the complex feature have been downloaded.
> GFF3 potentially requires waiting until the end of the library,
> though there is a way to hint that all the results have been
> returned.
>
> Both allow complex graphs.  That is, both allow cycles.  I
> assume we are restricting complex features to DAGs, but even
> then the following is possible
>
>    [root1]     [root2]     [root3]
>      | \          |         /
>      |  \         |        /
>      |   ------------------
>      |  |    node 4        |
>      |   ------------------
>      |  /
>      | /
>      |/
>   [node 5]
>
> Node 4 has three parents (root1, root2 and root3) and
> node 5 has two parents (root1 and node4)
>
> This may or may not make biological sense.  I don't know.  I
> only point out that it's there.
>
> I feel that complex annotations must only have a single root
> element, even if it's a synthetic one with no location.
>
> Next, consider writeback, with the following two complex features
>
>   [root1]                  [root2]
>    |    \                   |
>    |      \                 |
>    |        \               |
> [node1.1]  [node1.2]     [node2.1]
>
>
> Suppose someone adds a new "connector" node
>
>> -->---.
>                        |      V
>   [root1]              |   [root2]
>    |    \              |    |
>    |      \            |    |
>    |        \          ^    |
> [node1.1]  [node1.2]  |  [node2.1]
>      |                 |
>      V                 |
>    [connector]-->--->--^
>
> Should that sort of thing be allowed?  What's the model
> for the behavior?
>
> It seems to me there's a missing concept in DAS relating to
> complex features.  My model is that the "complex feature" is
> its own concept, which  I've been calling an "annotation".
> All simple features are annotations.  The connected nodes of
> a complex feature are also annotations.
>
> As such, two annotations cannot be combined like this.
> Writeback only occurs at the annotation level, in that
> new feature elements cannot be used to connect two existing
> annotations.
>
> We might also consider having a new interface for annotations
> (complex features), so they can be referred to by URI.  I
> don't think that's needed right now.
>
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From boconnor at ucla.edu  Tue Mar 21 19:47:51 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Tue, 21 Mar 2006 16:47:51 -0800
Subject: [DAS2] das.biopackages.net
Message-ID: <44209EB7.9070008@ucla.edu>

The DAS/2 server located at das.biopackages.net may be unavailable on 
and off for the next hour or so.  Just wanted to let everyone know in 
case someone is using it.

--Brian


From dalke at dalkescientific.com  Thu Mar 23 16:44:00 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 23 Mar 2006 13:44:00 -0800
Subject: [DAS2] complex features
In-Reply-To: <3879834dc8786f628c68e47a076c1e90@fruitfly.org>
References: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>
	<3879834dc8786f628c68e47a076c1e90@fruitfly.org>
Message-ID: <53840452abca7236130efd4e57f42aef@dalkescientific.com>

chris:
> The GFF3 spec says that Parent can only be used to indicate part_of 
> relations. If we go by the definition of part_of in the OBO relations 
> ontology, or any other definition of part_of (there are many), then 
> cycles are explicitly verboten, although the GFF3 docs do not state 
> this.

It looks like the most recent spec at
   http://song.sourceforge.net/gff3.shtml
does state this, although the earlier one did not:

   "A Parent relationship between two features that is not one of the
    Part-Of relationships listed in SO should trigger a parse exception
    Similarly, a set of Parent relationships that would cause a cycle
    should also trigger an exception."


> There's no reason in general why part_of graphs should have a single
> root, although it's certainly desirable from a software perspective.
> Dicistronic genes thow a bit of a spanner in the works. There's nothing
> to stop you adding a fake root, or refering to the maximally connected
> graph as an entity in its own right however.

I've been working with GFF3 data for a few days now, trying to
catch the different cases.  It isn't hard, but it had been a long
time since I worried about cycle detection.

The biggest problem has been keeping all the "could be a parent"
elements around until the entire data set is finished.  Except
for features with no ID and no Parents, parsers need to go to
the end of the file (or no-forward-references line) before
being able to do anything with the data.

In DAS it's easier because each feature lists all parents and
children, so it's possible to detect when a complex feature is
ready.  Even then it requires a bit of thinking to handle cases
with multiple roots.  It would be much easier if either all
complex features were in an element

   <COMPLEX-FEATURE>
    <FEATURE id="1" />
    <FEATURE id="2" />
   </COMPLEX-FEATURE>

or if there was a unique name to tie them together

    <FEATURE id="1" complex-feature-id="A"/>
    <FEATURE id="2" complex-feature-id="A"/>

Another solution is to make the problem simpler.  I see, for
example, that the biopython doesn't have any gff code and
the biojava one only works at the single feature level.  Only
bioperl implements a gff3 parser with support for complex features,
but it assumes all complex features are single rooted and that
the features are topologically sorted, so that parents come
before children.  It also looks like a diamond structure (single
root, two children, both with the same child) is supported on
input but the output assumes features are trees.

For example, I tried it just now on dmel-4-r4.3.gff from wormbase,
which I'm finding to be a bad example of what a GFF file should
look like.  It contains one duplicate ID, which bioperl catches
and dies on.  I fixed it. It then complains with a lot of

    MSG: Bio::SeqFeature::Annotated=HASH(0xba4a93c) is not contained
    within parent feature, and expansion is not valid, ignoring.

because the features are not topologically sorted, as in this
(trimmed) example.  The order is the same as in the file.

4  sim4:na_dbEST.same.dmel match_part  5175  5627 ...
                        Parent=88682278868229;Name=GH01459.5prime
4  sim4:na_dbEST.same.dmel match   5175    5627 ...
                       ID=88682278868229;Name=GH

The simpler the data model we use (eg, single rooted, output
must be topologically sorted with parents first) then the
more likely it is for client and server code to be correct and
the more likely there will be more DAS code.


					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Fri Mar 24 13:19:41 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Fri, 24 Mar 2006 18:19:41 +0000
Subject: [DAS2] 100th das1 source in registry
Message-ID: <23fe2aa8d3c4a9afc28782b3d3e58032@sanger.ac.uk>

Hi!

Today the 100th DAS1 source was registered in the DAS registration 
server at

http://das.sanger.ac.uk/registry/

It currently counts 101 DAS sources from 23 institutions in 9 countries.

The purpose of the DAS registration service is to keep track which DAS 
services are available
and to help with automated discovery of new DAS servers  on the client 
side.

Regards,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Gregg_Helt at affymetrix.com  Fri Mar 24 13:37:21 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Fri, 24 Mar 2006 10:37:21 -0800
Subject: [DAS2] 100th das1 source in registry
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA50@msex02.affymetrix.com>

Congratulations!

On a related note, is there a way to automatically register DAS/2
servers yet?  If not, can I send you info to add the Affymetrix test
DAS/2 server to the registry?

	Thanks,
	Gregg

> -----Original Message-----
> From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-
> bio.org] On Behalf Of Andreas Prlic
> Sent: Friday, March 24, 2006 10:20 AM
> To: DAS/2
> Subject: [DAS2] 100th das1 source in registry
> 
> Hi!
> 
> Today the 100th DAS1 source was registered in the DAS registration
> server at
> 
> http://das.sanger.ac.uk/registry/
> 
> It currently counts 101 DAS sources from 23 institutions in 9
countries.
> 
> The purpose of the DAS registration service is to keep track which DAS
> services are available
> and to help with automated discovery of new DAS servers  on the client
> side.
> 
> Regards,
> Andreas
> 
> 
>
-----------------------------------------------------------------------
> 
> Andreas Prlic      Wellcome Trust Sanger Institute
>                                Hinxton, Cambridge CB10 1SA, UK
> 			 +44 (0) 1223 49 6891
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From ap3 at sanger.ac.uk  Sat Mar 25 06:13:06 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Sat, 25 Mar 2006 11:13:06 +0000
Subject: [DAS2] 100th das1 source in registry
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA50@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA50@msex02.affymetrix.com>
Message-ID: <e34b6e6ad566f659e674f5d0f1dc176f@sanger.ac.uk>

> On a related note, is there a way to automatically register DAS/2
> servers yet?

A beta - version can be tried  at the toy-registry at
http://www.spice-3d.org/dasregistry/registerDas2Source.jsp

and the results will be visible at
http://www.spice-3d.org/dasregistry/das2/sources

- so far this provides a simple upload mechanism that is based on the 
sources decription.
what is still missing is a validation of the user provided data ("does 
this request give really a features response?")
plus other things like a html representation of the das2 servers.

  I think it would be great if Andrew's Dasypus server could provide an 
interface to the validation
mechanism that could be used by programs. If validation fails the 
response could contain
a link, to point the user to the nice error report web page.

will be abroad next week so can't join for the call...

Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Gregg_Helt at affymetrix.com  Mon Mar 27 11:24:53 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 27 Mar 2006 08:24:53 -0800
Subject: [DAS2] Agenda for today's teleconference
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA53@msex02.affymetrix.com>

We're back on the standard DAS/2 teleconference schedule, every Monday
at 9:30 AM Pacific time.
 
Suggestions for today's agenda:
Code sprint summary
DAS/2 grant status
Writeback spec & implementation
???
 
Teleconference # US:   800-531-3250
            International:   303-928-2693
Conference ID: 2879055
Passcode: 1365
 

From Steve_Chervitz at affymetrix.com  Mon Mar 27 14:05:28 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Mon, 27 Mar 2006 11:05:28 -0800
Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 27 Mar 2006
Message-ID: <C04D7778.1D4FC%Steve_Chervitz@affymetrix.com>

Notes from the weekly DAS/2 teleconference, 27 Mar 2006

$Id: das2-teleconf-2006-03-27.txt,v 1.1 2006/03/27 19:03:30 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Gregg Helt
  CSHL: Lincoln Stein
  Dalke Scientific: Andrew Dalke
  UC Berkeley: Nomi Harris
  UCLA: Allen Day 
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Proposed agenda:
 * Code sprint summary
 * DAS/2 grant status
 * Writeback spec & implementation


[Notetaker: missed the first 40min - apologies]


Topic: Code sprint summary
--------------------------

gh: pleased with our progress during the last code sprint (13-17 Mar)

[Notetaker: detailed summaries of what folks did during this code sprint
are described here:
http://lists.open-bio.org/pipermail/das2/2006-March/000668.html ]


Topic: Writeback 
----------------

[Discussion in progress]

ls: in my model, every feature has a unique id, when you update it,
it's going to make the change to the object and not create a new one.
the object is associated with url in some way, when you update the
position of this exon, it's going to change some attributes of it.

gh: thomas proposed the alternative: every time you change a feature
you create a new one with a pointer back to the old one.

ad: can't speak for what db implementers will do for versioning of
features. only taking about merging from different complex
features. So only when you merge from complex ones.

ls: this is the history tracking business. writeback will explicitly
support merges and splits.
ad: how detailed does the spec need to be?
ls: driven by requirements.
ad: what are the reqts? I can't go further without more details. roy
said eevery modification gets new version, so you could do time
travel, if your db supported that.

ls: does igb or apollo explicitly support merges and splits among
transcripts?
gh: yes. curation in igb is experimental (now turned off). but it does
support these. as does apollo. so these are essential.
ls: writeback should have instructions for how feature will adopt
children of a subfeature. one feature adopts children of the other and
previous feature is now deprecated. there's a specific set
of operations for creating new features, renaming, spliting, and merging.
perhaps Nomi should write down what operations that apollo supports.

nh: yes, all those are supported as well as things like adjusting
endpoints of start of translation.
apollo can merge transcripts within a gene and between genes
(which offers to merge the associated genes). curators can do
'splurge' - a split, merge combo.
ls: that sounds like suzi's nomenclature.

gh: the db that apollo writes back to, do changes create new versions
of feature or change the feature itself?
nh: not sure. mark did the work with chado. I know they were doing
something to rewrite the entire feature if anything changed.

[A] nomi will ask Mark to join in discussion next week (3 April).

aday: what fraction of the operations are doing simple vs complex
things? eg., revising the gene model.
nh: revision happens a lot. mostly adjusting endpoints. splits and
merges are infrequent. adding annotation. But it doesn't matter how
infrequent the operations are, we either support them or we don't.

ad: when there are changes in the model, how does the client get
notified that the change occurred?
nh: that's tricky.
gh: this is outside the scope of the das/2 spec itself. as long as we
have locks to prevent simultaneous modification, that is
sufficient.

ad: there's no mechanism for polling server.
gh: yes, just requery server.
gh: but your client doesn't do it.
gh: I'm thinking of adding polling to get the last modified stuff.
For now, one can simply re-start your session to see what has changed.

aday: is the portion of writeback spec for modifying endpoints, simple
add/delete of annotations stable?
ad: the general idea is unchanged.

gh: priority here is before next meeting: brian and allen read over
writeback spec and identify any issues as implementers.
aday: looking for an 80% solution. not dealing with heritance wihich
is difficult. 
nh: splits and merges can be done with combos of simpler ops.

aday: performace operations will be affected. graph flattening and
partial indexes. splits and merges will affect this table, so will
have to trigger update of that table any time there's a
split/merge. this will have big impact on query performance: could be
1-2 sec for yeast, 30-60 min for human.

gh: what about if you do that update 1x/day? Then users would be
working off a snapshot that was current as of the end of previous
day. 
aday: caching on server responses will also be affected, unless we
turn caching off. maybe I can tell apache to remove a subset of cached
pages and leave others intact.

aday: for tiling requests - server could find affected blocks and
purge those, instead of purging the entire cache.
gh: you can't rely on any client to use your tiling strategy. but
could be helpful for those clients that use it.
aday: basically we'll have to turn caching off when we start doing
writeback.
gh: is there a way for server to detect what has changed?
gh: if database detects change it can flush cache for that sequence.
aday: maybe. possibly the easiest way to do this is via tiling.

gh: say you have two servers:
   1) everthing that can be edited
   2) everything that has been edited (slower)
aday: main server has all features and second server handles
writeback, just writes to gff file, then cron runs once a night to
merge the gff into the db.

gh: separate dbs: 1) curation  2) everything that has been edited.
aday: yes. persistent flat file adapter can be used for one of them.
gh: this is the sort of detail I'm looking for w/r/t development of
the writeback spec.

[A] allen and brian look over writeback spec to discuss on 3 April.


From nomi at fruitfly.org  Mon Mar 27 14:42:59 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 27 Mar 2006 11:42:59 -0800
Subject: [DAS2] Mark Gibson on Apollo writeback to Chado
Message-ID: <ryi64lzoerg.fsf@spongecake.lbl.gov>

mark gibson said that he plans to attend next monday's DAS/2
teleconference.  he also gave me permission to forward this message that
he wrote recently in response to a group that is adapting apollo and
wondered what he thought about direct-to-chado writeback vs. the use of
chadoxml as an intermediate storage format.  FlyBase Harvard prefers to
use the latter approach because (we gather) they worry about possibly
corrupting the database by having clients write directly to it.  if
anyone from harvard is reading this and feels that mark has
misrepresented their approach, please set us straight!

               Nomi

On 10 March 2006, Mark Gibson wrote:
 > Im rather biased as a I wrote the chado jdbc adapter [for Apollo], but let me put forth my 
 > view of chado jdbc vs chado xml.
 > 
 > The chado Jdbc adapter is transactional, the chado xml adapter is not. What this 
 > means is jdbc only makes changes in the database that reflect what has actually 
 > been changed in the apollo session, like updating a row in a table; with chado 
 > xml you just get the whole dump. So if a synonym has been added jdbc will add a 
 > row to the synonym table. For xml you will get the whole dump of the region you 
 > were editing (probably a gene) no matter how small the edit.
 > 
 > What I believe Harvard/Flybase then does (with chado xml) is wipe out the gene 
 > from the database and reinsert the gene from the chado xml. The problem with 
 > this approach is if you have data in the db thats not associated with apollo 
 > (for flybase this would be phenotype data) then that will get wiped out as well, 
 > and there has to be some way of reinstating non-apollo data. If you dont have 
 > non-apollo data and dont intend on having it in the future this isnt a huge 
 > issue I suppose. I think Harvard is integrating non-apollo data into their chado 
 > database.
 > 
 > I think what they are going to do is actually figure out all of the transactions 
 > by comparing the chado xml with the chado database, which is what apollo already 
 > does, but I'm not sure as Im not so in touch with them these days (as Im not 
 > working with apollo these days - waiting for new grant to kick in).
 > 
 > Since the paradigm with chado xml is wipe out & reload, then apollo has to make 
 > sure it preserves every bit of the chado xml that came in. Theres a bunch of 
 > stuff thats in chado/chado xml that the apollo datamodel is unconcerned with, 
 > and has no need to be concerned with as its stuff that it doesnt visualize. In 
 > other words apollos data model is solely for apollos task of visualizing data, 
 > not for roundtripping what we call non-apollo data. In writing the chado xml 
 > adapter for FlyBase, Nomi Harris had a heck of a time with these issues, and she 
 > can elaborate on this I suppose.
 > 
 > I'm personally not fond of chado xml because its basically a relational database 
 > dump, so its extremely verbose. It redundantly has information for lots of joins 
 > to data in other tables - like a cvterm entry can take 10 or 20 lines of chado 
 > xml, and a given cvterm may be used a zillion times in a given chado xml file 
 > (as every feature has a cvterm). So these files can get rather large.
 > 
 > The solution for this verbose output is to use what I call macros in chado xml. 
 > Macros are supported by xort. They take the 15 line cvterm entry and reduce it 
 > to a line or 2 making the file size much more reasonable. The apollo chado xml 
 > adapter does not support macros, so you have to use unmacro'd chado xml for 
 > apollo purposes. Nomi Harris had a hard enough time getting the chado xml 
 > adapter working for flybase(and did a great job with a harrowing task), that she 
 > did not have time to take on the macro issue. If you wanted macros (and smaller 
 > file sizes) you would have to add this functionality to the chado xml adapter 
 > (are there java programmers in your group?).
 > 
 > One of the arguments against the jdbc adapter is that its dangerous because it 
 > goes straight into the database so if there are any bugs in the data adapter 
 > then the database could get corrupted - some groups find this a bit precarious. 
 > This is a valid argument. I think theres 2 solutions here. One is to thoroughly 
 > test the adapter out against a test database until you are confident that bugs 
 > are hammered out.
 > 
 > Another solution is to not go straight from apollo to the database. You can use 
 > an interim format and actually use apollo to get that interim format into the 
 > database. Of course one choice for interim format is chado xml and then you are 
 > at the the chado xml solution. The other choice for file format is GAME xml. You 
 > can then use apollo to load game into the chado database, and this can be done 
 > at the command line (with batching) so you dont have to bring up the gui to do 
 > it. Also chado xml can be loaded into chado via apollo as well (of course xort 
 > does this as well but not with transactions)
 > 
 > So then the question is if Im not going to go straight into the database, why 
 > would I choose game over chado xml?  Or if Im using chado xml should I use 
 > apollo or xort to load into chado. I think if you are using chado xml it makes 
 > sense to use xort as it is the tried & true technology for chado xml. The 
 > advantage of going through apollo is that it also uses the transactions from 
 > apollo (theres a transaction xml file) and thus writes back the edits in a 
 > transactional way as mentioned above rather than in a wipe out & reload fashion.
 > 
 > Also Game is a tried & true technology that has been used with apollo in 
 > production at flybase (before chado came along) for many years now. One 
 > criticism of it has been that DTD/XSD/schema has been a moving target, nor has 
 > it been described. That is not as true anymore. Nomi Harris has made a xsd for 
 > it as well as a rng. But I must confess that I have recently added the ability 
 > to have one level annotations in game (previously 1 levels had to be hacked as 3 
 > levels). Also game is a lot less verbose than un-macro'd chado xml, as it more 
 > or less fits with the apollo datamodel. One advantage of chado xml over game xml 
 > is that it is more flexible in terms of taking on features of arbitrary depth.
 > 
 > The chado xml adapter was developed for FlyBase and as far as I know has not 
 > been taken on by any other groups yet. Nomi can elaborate on this, but I think 
 > what this might mean is that there are places where things are FlyBase specific. 
 > If you went with chado xml the adapter would have to be generalized. Its a good 
 > exercise for the adapter to go through, but it will take a bit of work. Nomi can 
 > probably comment on how hard generalizing might be. I could be wrong about this 
 > but I think the current status with the chado xml adapter is that Harvard has 
 > done a bunch of testing on it but they havent put it into production yet.
 > 
 > The jdbc adapter is being used by several groups so has been forced to be 
 > generalized. One thing I have found is that chado databases vary all too much 
 > from mod to mod (ontologies change). There is a configuration file for the jdbc 
 > adapter that has settings for the differences that I encountered. I initially 
 > wrote it for cold spring harbors rice database that will be used in classrooms. 
 > Its working for rice in theory, but they havent actually used it much in the 
 > classroom yet. For rice the model is to save to game and use apollo command line 
 > to save game & transactions back to chado.
 > 
 > Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on the jdbc 
 > adapter for his group. I have cc'd him on this email as I think he will have a 
 > lot to say about the jdbc adapter. Cyril has uncovered many bugs and has fixed a 
 > lot of them (thank you cyril) as hes a very savvy java programmer. And he has 
 > also forced the adapter to generalize and brought about the evolution of the 
 > config file to adapt to chado differences. But as Cyril can attest (Cyril feel 
 > free to elaborate) it has been a lot of work to get jdbc working for him. There 
 > were a lot of bugs to fix that we both went after. Hopefully now its a bit more 
 > stable and the next db/mod wont have as many problems. I think Cyril is still at 
 > the test phase and hasn't gone into production (Cyril?)
 > 
 > Berkeley is using the jdbc adapter for an in house project. They are using the 
 > jdbc reader to load up game files (as the straight jdbc reader is slow as the 
 > chado db is rather slow) which are then loaded by a curator. They are saving 
 > game, and then I think chris mungall is xslting game to chado xml which is then 
 > saved with xort - or he is somehow writing game in another way - not actually 
 > sure. The Berkeley group drove the need for 1 level annotations(in jdbc,game,& 
 > apollo datmodel)
 > 
 > Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they use it there. I 
 > believe they are intending to use the write adapter but dont yet do so (Jonathan?).
 > 
 > I should mention that reading jdbc straight from chado tends to be slow, as I 
 > find that chado is a slow database, at least for Berkeley. It really depends on 
 > the db vendor and the amount of data. TIGRs reading is actually really zippy. 
 > The workaround for slow chados is to dump game files that read in pretty fast.
 > 
 > In all fairness, you should probably email with FlyBase (& Chris Mungall) and 
 > get the pros of using chado xml & xort, which they can give a far better answer 
 > on than I.
 > 
 > Hope this helps,
 > Mark


From dalke at dalkescientific.com  Mon Mar 27 15:59:28 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 27 Mar 2006 13:59:28 -0700
Subject: [DAS2] cell phone battery dead
Message-ID: <3d9298aced5c4efb7d9c34574fcf7618@dalkescientific.com>

Sorry about the drop out towards the end of today's conversation.
The battery on my phone died.

					Andrew
					dalke at dalkescientific.com


From boconnor at ucla.edu  Wed Mar  1 21:34:38 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Wed, 01 Mar 2006 13:34:38 -0800
Subject: [DAS2] Re: Re DAS2 Server
In-Reply-To: <C02B5657.A1E%vedupuganti@tgen.org>
References: <C02B5657.A1E%vedupuganti@tgen.org>
Message-ID: <4406136E.6060703@ucla.edu>

Hi Vidya,

So I think your best option is to try the RPM.  I built a Fedora Core 2 
RPM for DAS2 and just released it to http://biopackages.net last night.  
I could really use someone to test it so feedback would be great.  The 
RPM approach is nice because yum will take care of installing all the 
dependencies including the chado database.

If you're not using FC2 then it's a little but more involved.  We don't 
really have a lot of docs but I could update the README in cvs (see 
http://sourceforge.net/projects/gmod it's the "das2" module).   Until 
recently there wasn't  really an install process you just do a "perl 
Makefile.PL; make; make test" to run DAS2.  There's now an "install" 
target so you can do "perl Makefile.PL; make; sudo make install".  You 
need to set some environmental variables, install a chado DB, and make 
sure all the perl module dependencies are installed before you do this 
though.  See the Makefile.PL for the environmental variables you need to 
set.  I'll update the README to include information about the dependencies.

Hope this helps!  I cc'd Allen Day too, he might have some helpful hints...

--Brian

Vidya Edupuganti wrote:

>Hi Brian,
>I am trying to setup DAS/2 server so that it can be used with Affymetrix's
>IGB browser. I was trying to find a user manual for setting up DAS/2 server.
>I could not find any. Can you please direct me to a place where I can find
>it. If there isn't  any can you please give me some inputs on how to install
>a DAS/2 server and load data.
>I really appreciate your help,
>Thanks
>Vidya
>
>
>
>
>Vidyadari Edupuganti
>Bioinformatician, Bioinformatics Research Unit
>The Translational Genomics Research Unit (TGen)
>445 N. Fifth St
>Phoenix, AZ, 85004, USA
>
>
>
>  
>


From dalke at dalkescientific.com  Fri Mar  3 09:55:02 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 3 Mar 2006 02:55:02 -0700
Subject: [DAS2] working das validator
Message-ID: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com>

I have a running validator at

http://cgi.biodas.org:8080/


I've only tested it with SOURCES document but there's little
that would fail with the others.

I had planned to get this up a couple days ago but I've been
distracted learning more about Javascript and a couple of Javascript
libraries.  I used Mochikit to make the interactivity you see
there, and I have some ideas about how to use Dojo -- but not
for a couple of weeks.

The code goes through the following validation steps:

  - TODO - handle if the URL is not fetchable and handle timeouts
  - check that the content-type agrees with the document type
  - check that it's well-formed XML; report error where not
  - check that the root element matches the document type
  - check that it passed the Relax-NG validation;
  - report the id and href fields which are empty strings
  - report if any date fields are not iso dates

There are many more checks I could add.  They are easy now
that the scaffold is there.

I'm going to work on the next draft now.

After that I'll get back to the validator.  I want to add
hyperlinks on fields which are links, and I have an idea of
how to add a "SEARCH" button next to the query urls which
creates a popup where you can fill in the different fields
before doing the search.

Budget-wise I'm not sure how to charge the last few days
of work as it was a "wouldn't it be neat if" project rather
than something really needed.  It is neat though ...

					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Fri Mar  3 17:34:11 2006
From: Steve_Chervitz at affymetrix.com (Chervitz, Steve)
Date: Fri, 3 Mar 2006 09:34:11 -0800
Subject: [DAS2] working das validator
In-Reply-To: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com>
Message-ID: <C02DBE13.1CB21%Steve_Chervitz@affymetrix.com>

Andrew,

Nice work on the web interface to the validator. Before you dive back into
the spec, could you troubleshoot these 500 errors I'm getting on your
server?

URL: http://das.biopackages.net/das/genome

With the "guess" radio button I get:

   500 Internal error
   ....
   TypeError: GuessFromHeader() takes exactly 2 arguments (1 given)

With any other radio button I get:

   500 Internal error
   ....
   AttributeError: BodyError instance has no attribute 'args'

Steve

> From: Andrew Dalke <dalke at dalkescientific.com>
> Date: Fri, 3 Mar 2006 02:55:02 -0700
> To: DAS/2 <das2 at portal.open-bio.org>
> Subject: [DAS2] working das validator
> 
> I have a running validator at
> 
> http://cgi.biodas.org:8080/
> 
> 
> I've only tested it with SOURCES document but there's little
> that would fail with the others.
> 
> I had planned to get this up a couple days ago but I've been
> distracted learning more about Javascript and a couple of Javascript
> libraries.  I used Mochikit to make the interactivity you see
> there, and I have some ideas about how to use Dojo -- but not
> for a couple of weeks.
> 
> The code goes through the following validation steps:
> 
>   - TODO - handle if the URL is not fetchable and handle timeouts
>   - check that the content-type agrees with the document type
>   - check that it's well-formed XML; report error where not
>   - check that the root element matches the document type
>   - check that it passed the Relax-NG validation;
>   - report the id and href fields which are empty strings
>   - report if any date fields are not iso dates
> 
> There are many more checks I could add.  They are easy now
> that the scaffold is there.
> 
> I'm going to work on the next draft now.
> 
> After that I'll get back to the validator.  I want to add
> hyperlinks on fields which are links, and I have an idea of
> how to add a "SEARCH" button next to the query urls which
> creates a popup where you can fill in the different fields
> before doing the search.
> 
> Budget-wise I'm not sure how to charge the last few days
> of work as it was a "wouldn't it be neat if" project rather
> than something really needed.  It is neat though ...
> 
> Andrew
> dalke at dalkescientific.com
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Fri Mar  3 18:04:12 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 3 Mar 2006 11:04:12 -0700
Subject: [DAS2] working das validator
In-Reply-To: <C02DBE13.1CB21%Steve_Chervitz@affymetrix.com>
References: <C02DBE13.1CB21%Steve_Chervitz@affymetrix.com>
Message-ID: <5d7729f77f8d4b6dcbd8dacd04701c19@dalkescientific.com>

Hi Steve,

   I saw those errors in the log file but wasn't sure if they were
from you or Gregg.

> URL: http://das.biopackages.net/das/genome
>
> With the "guess" radio button I get:
>
>    500 Internal error
>    ....
>    TypeError: GuessFromHeader() takes exactly 2 arguments (1 given)

Fixed.

> With any other radio button I get:
>
>    500 Internal error
>    ....
>    AttributeError: BodyError instance has no attribute 'args'

Fixed.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Sun Mar  5 01:59:15 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Sat, 4 Mar 2006 18:59:15 -0700
Subject: [DAS2] current text of draft 3 of spec
Message-ID: <5e3c38635022ba8ae291cd6c4e036eef@dalkescientific.com>

I've been working on the 3rd draft for the spec.  Because of
the confusion in the previous version I've decided on a
different approach where I jump into the middle and describe
how the parts fit together before getting into the details of
every element type or the theory behind the architecture.

I think this flows much better.

====================

DAS is a protocol for sharing biological data.  This version of the
specification, DAS 2.0, describes features located on the genomic
sequence.  Future versions will add support for sharing annotations of
protein sequences, expression data, 3D structures and ontologies.  The
genomic DAS interface is deliberately designed so there will be a
large core shared with the protein sequence DAS.

A DAS 2.0 annotation server provides feature information about one or
more genome sources.  Each source may have one or more versions.
Different versions are usually based on different assemblies.  As an
implementation detail an assembly and corresponding sequence data may
be distributed via a different machine, which is called the reference
server.

Annotations are located on the genomic sequence with a start and end
position.  The range may be specified multiple times if there are
alternate coordinate systems.  An annotation may contain multiple
non-continguous parts, making it the parent of those parts.  Some
parts may have more than one parent.  Annotations have a type based on
terms in SOFA (Sequence Ontology for Feature Annotation).  Stylesheets
contain a set of properties used to depict a given type.

Annotations can be searched by range, type, and a properties table
associated with each annotation.  These are called feature filters.

DAS 2.0 is implemented using a ReST architecture.  Each document (also
called an entity or object) has a name, which is a URL.  Fetching the
URL gets information about the document.  The DAS-specific documents
are all in XML.  Other data types have existing widely used formats,
and sometimes more than one for the same data.  A DAS server may
provide a distinct document for each of these formats, along with
information about which formats are available.

DAS 2.0 addresses some shortcomings of the DAS 1.x protocol, including:

  * Better support for hierachical structures (e.g. transcript + exons)

  * Ontology-based feature annotations

  * Allow multiple formats, including formats only appropriate for
    some feature types

  * A lock-based editing protocol for curational clients

  * An extensible namespacing system that allows annotations in
   non-genomic coordinates (e.g. uniprot protein coordinates or PDB
   structure coordinates)

=====

A DAS server supplies information about genomic sequence data sources.
The collection of all sources, each data source, and each version of a
data source are accessible through a URL.  All three classes of URLs
return a document of content-type 'application/x-das-sources+xml'
though likely with differing amounts of detail.  A 'versioned source'
request returns information only about a specific version of a data
source.  A 'source' request returns the list of all the versioned
source data for that source.  A 'sources' request returns the list of
all the source data, including all the versioned source data.

The URLs might not be distinct.  For example, a server with only one
version of one data source may use the same URL for all three
documents, and a server for a single organism may use the same URL for
the 'sources' and 'source' documents.

Most servers will list only the data sources provided by that server.
Some servers combine the sources documents from other servers into a
single document.  These registry servers act as a centralized index
and reduce configuration and network overhead.  A registry server uses
the same sources format as an annotation server.

Here is an example of a simple sources document which makes no
distinction between the three sources categories.


Request:

http://www.example.com/das/genome/yeast.xml

Response:

Content-Type: application/x-das-sources+xml

<?xml version="1.0" encoding="UTF-8"?>
<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.com/das/genome/">

   <SOURCE id="yeast.xml" title="Saccharomyces cerevisiae (Baker's  
yeast) genome"
          doc_href="http://www.example.com/yeast.html">
     <VERSION id="yeast.xml" created="2005-12-05">
       <COORDINATES taxid="4932" source="Gene_ID" authority="SGD32" />
       <CAPABILITY type="features" query_id="features.xml" />
       <CAPABILITY type="types" query_id="types.xml"/>
     </VERSION>
   </SOURCE>

</SOURCES>

All identifiers and href attributes in DAS documents follow the XML
Base specification (see http://www.w3.org/TR/xmlbase/ ) in resolving
partial identifiers and href attributes.  In this case the id
"yeast.xml" is fully resolved to
"http://www.example.com/das/genome/yeast.xml".


Here is an example of a more complicated sources document with
multiple organisms each with multiple versions.  Each of the two
source documents (one for each organism) has a distinct URL as does
each of the version for each organism.  This is a pure registry server
because the actual annotation data comes from other machines.

Request:
   http://www.biodas.org/known_servers

Response:

Content-Type: application/x-das-sources+xml

<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00">
   <SOURCE id="http://das.ensembl.org/das/SPICEDS/"  
title="das_vega_trans">
     <VERSION id="http://das.ensembl.org/das/SPICEDS/127/"  
created="2005-05-23">
       <MAINTAINER email="someone at sanger.ac.uk" />
       <COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
                    test_range="BX255914" />
       <CAPABILITY types="segments"
               
query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
       <CAPABILITY type="features"
            query_id="http://das.ensembl.org/das/SPICEDS/127/features" />
         <SUPPORTS name="das2queries" />
       </CAPABILITY>
       <CAPABILITY type="types"
            query_id="http://das.ensembl.org/das/SPICEDS/127/types" />
     </VERSION>

     <VERSION id="http://das.ensembl.org/das/SPICEDS/128/"  
created="2005-08-13">
       <MAINTAINER email="someone-else at sanger.ac.uk" />
       <COORDINATES taxid="7955" source="Chromosome" authority="ZV4"
                    test_range="BX255914" />
       <CAPABILITY type="segments"
               
query_id="http://www.ebi.ac.uk/das-srv/genome/zebrafish-62">
       <CAPABILITY type="features"
            query_id="http://das.ensembl.org/das/SPICEDS/128/features" />
         <SUPPORTS name="das2queries" />
       </CAPABILITY>
       <CAPABILITY type="types"
            query_id="http://das.ensembl.org/das/SPICEDS/128/types" />
       <CAPABILITY type="locks"  
url="http://das.ensembl.org/das/SPICEDS/128/locks" />
       <CAPABILITY type="writeback"
                 url="http://das.ensembl.org/das/SPICEDS/128/locks" />
     </VERSION>
   </SOURCE>

   <SOURCE id="http://www.example.com/das2/mus/sources.xml" title="Mus  
musculus">
     <VERSION id="http://www.example.com/das2/mus/42/sources.xml"  
created="2006-02-11">
       <MAINTAINER email="pied-piper at hamlet.ac.uk" />
       <COORDINATES taxid="10090" source="Clone" authority="Ensembl"
                 test_range="AL935121" />
       <CAPABILITY type="features"
             
query_id="http://www.example.com/cgi-bin/features-mus-v42.cgi">
         <SUPPORTS name="das2queries" />
       </CAPABILITY>
       <CAPABILITY type="types"
            query_id="http://www.example.com/das2/mus/v42/types.xml" />
     </VERSION>
   </SOURCE>
</SOURCES>

Each SOURCE id and VERSION id is individually fetchable so the URL
"http://das.ensembl.org/das/SPICEDS/" returns a sources document with
the SOURCE record for "das_vega_trans" and both of its VERSION
subelements while "http://das.ensembl.org/das/SPICEDS/128/" returns a
sources document with only the second of its VERSION subelements.

DAS documents refer to other documents through URLs.  There are no
restrictions on the internal form of the URLs, other than the query
string portion.  Server implementers are free to choose URLs which
best fit the architecture needs.  For example, a simple DAS server may
be implemented as a set of XML files hosted by a standard web server
while more complex servers with search support may be implemented as
CGI scripts or through embedded web server extensions.  The URLs do
not need to define a hierarchical structure nor even be on the same
machine. Compare this to the DAS1 specification where some URLs were
constructed by direct string modification of other URLs.

=====

Each versioned source contains a set of segments. A segment is the
largest chunk of contiguous sequence. For fully sequenced organisms a
segment may be a chromosome.  For partially assembled genomes where
the distance between the assembled regions is not known then each
region may be its own segment.  If a server provides annotations in
contig space then each contig is a segment.  Feature locations are
specified on ranges of segments which is why a specific set of
segments is called a coordinate system.  [coordinate-system] This
specification does not describe how to do alignments between different
coordinate systems.


The sources document format has two ways to describe the coordinate
system.  The optional COORDINATES element uniquely characterize the
coordinate system.  If two data sources have the same authority and
source values then they must be annotations on the same coordinate
system.  The specific coordinate system is also called the "reference
sequence".

A versioned source may contain CAPABILITY elements which describe
different ways to request additional data from a DAS server.  Each
CAPABILITY has a type that describes how to use the corresponding URL
to query a DAS server.  A CAPABILITY element of type "segments" has a
query URL which returns a document of content-type
"application/x-das-segments+xml".  A segments document lists
information about the segments in the coordinate system.  Here is an
example of a segments document.

Request:

http://www.biodas.org/das2/h.sapiens/v3/segments.xml

Response:

Content-Type: application/x-das-segments+xml

<?xml version="1.0" encoding="UTF-8"?>
<SEGMENTS xmlns="http://www.biodas.org/ns/das/genome/2.00">
  <SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr1.xml"
      name="Chr1" length="245522847"
      doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=1"/>
  <SEGMENT id="http://www.biodas.org/das2/h.sapiens/v37/segment/Chr2.xml"
      name="Chr2" length="243018229"
      doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=2"/>
</SEGMENTS>

=====

The versioned source record for an annotation server must include a
CAPABILITY of type "features".  A client may use the query URL from
the features CAPABILTY points to select features which match certain
criteria.  If no criteria are specified the server must return all
features unless there are too many features to return.  In that case
it must respond with an error message.

Unless an alternate format is specified, the response from the
features query is a document of content-type
"application/x-das-features+xml" containing all of the matching
features.  Here is an example features document for a server which
contains a gene and an alignment.

Request:

http://das.biopackages.net/das/genome/yeast/S228C/features.pl

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.org/volvox/1/">
  <FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
    <LOC segment="Chr2/1200:2917:1" />
  </FEATURE>

  <FEATURE id="feature/hit12"
           type_id="type/est-alignment"
           created="2001-12-15T22:43:36"
           modified="2004-09-26T21:10:15" >

    <LOC segment="Chr3/1201:1400:1" />
    <PART id="feature/hit12.hsp1" />
    <PART id="feature/hit12.hsp2" />
    <ALIGN target_id="feature/yk12391" range="200:299" />
    <PROP key="est2genomescore" value="180" />
  </FEATURE>

  <FEATURE id="feature/hit12.hsp1"
           type_id="type/est-alignment-hsp">
    <LOC segment="Chr3/1201:1250:-1" />
    <PARENT id="feature/hit12"/>
    <ALIGN target_id="feature/yk12391" range="1:52" gap="M49 D1 M1"/>
    <PROP  key="est2genomescore" value="180" />
  </FEATURE>

  <FEATURE id="feature/hit12.hsp2"
           type_id="type/est-alignment-hsp" >
    <LOC segment="Chr3/1351:1400:1" />
    <PARENT id="feature/hit12" />
    <ALIGN target_id="feature/yk12391" range="53:100" gap="M20 D1 G1  
M30" />
    <PROP  key="est2genomescore" value="120" />
  </FEATURE>

</FEATURES>

Each feature has a unique identifier and an identifer linking it to a
type record.  Both identifiers are URLs and should be directly
fetchable.  Simple features can be located on a region of a segment.
More complex features like a gapped alignment are represented through
a parent/part relationship.  A feature may have multiple parents and
multiple parts.

=====

An annotation server may contain many features while the client may
only be interested in a subset; most likely features in a given
portion of the reference sequence.  To help minimize the bandwidth
overhead the feature query URL should support the DAS feature filter
language.  The syntax uses the standard HTML form-urlencoded GET query
syntax.  For example, here is a request for all features on Chr2.

Request:

http://www.example.org/volvox/1/features.cgi?inside=Chr2

Response:

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.org/volvox/1/">
  <FEATURE id="feature/cTel54X" type_id="type/gene" name="tg-3">
    <LOC segment="Chr2/1200:2917:1" />
  </FEATURE>

  <FEATURE id="feature/hit12"
           type_id="type/est-alignment"
           created="2001-12-15T22:43:36"
           modified="2004-09-26T21:10:15" >

    <LOC segment="Chr3/1201:1400:1" />
    <PART id="feature/hit12.hsp1" />
    <PART id="feature/hit12.hsp2" />
    <ALIGN target_id="feature/yk12391" range="200:299" />
    <PROP key="est2genomescore" value="180" />
  </FEATURE>
</FEATURES>

and here is the rather long one for all EST alignments

Request:

http://www.example.org/volvox/1/features.cgi? 
type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment

Response:

Content-Type: application/x-das-features+xml

<FEATURES xmlns="http://www.biodas.org/ns/das/genome/2.00"
           xml:base="http://www.example.org/volvox/1/">
  <FEATURE id="feature/hit12"
           type_id="type/est-alignment"
           created="2001-12-15T22:43:36"
           modified="2004-09-26T21:10:15" >

    <LOC segment="Chr3/1201:1400:1" />
    <PART id="feature/hit12.hsp1" />
    <PART id="feature/hit12.hsp2" />
    <ALIGN target_id="feature/yk12391" range="200:299" />
    <PROP key="est2genomescore" value="180" />
  </FEATURE>
</FEATURES>

=====

All features are linked to a type record.  DAS types do not describe a
formal type system in that DAS types do not derive from other DAS
types.  Instead it links to an external ontology term and describes
how to depict features of that type.

A DAS annotation server must contain a CAPABILITY element of type
"types".  A client may use its query URL to fetch a document of
content-type "application/x-das-types+xml". The document lists all of
the types available on the server.  We expect that servers will have
at most a few dozen types so DAS does not support type filters.

The following is a hypothetical example of a DAS annotation server
providing GENSCAN gene predictions for zebrafish.  Each feature is
either of type
"http://www.example.org/das/zebrafish/build19/high-type" or
"http://www.example.org/das/zebrafish/build19/low-type" depending on
if the data provider determined it was a high probability or low
probability prediction.  Even though there are two different type
records the refer to the same ontology term, in this case the SO term
for "gene".  The distinction exists so that the high probability
features are depicted differently from the low probability features.

Request:

http://www.example.org/das/zebrafish/build19/types

Response:

Content-Type: application/x-das-types+xml

<TYPES xmlns="http://www.biodas.org/ns/das/genome/2.00"
        xml:base="http://www.example.org/das/zebrafish/build19/">
   <TYPE id="high-type" title="High probability gene predictions"
        
doc_href="http://www.example.org/docs/genscan_prediction.html#high"
       source="GENSCAN 1.0"
        
ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
       accession="SO:0000704"
     <STYLE>
       <BOX fgcolor="red" border_width="1"/>
     </STYLE>
   </TYPE>
   <TYPE id="low-type" title="Low probability gene predictions"
       doc_href="http://www.example.org/docs/genscan_prediction.html#low"
       source="GENSCAN 1.0"
        
ontology="http://song.sourceforge.net/XXX/does/not/exist/SO/0000704"
       accession="SO:0000704"
     <STYLE>
       <BOX fgcolor="yellow" border_width="1"/>
     </STYLE>
   </TYPE>

</TYPES>


[coordinate-system]

We make a distinction between "coordinate system" and "numbering
system".  The coordinate system is the set of segment on which
features are located.  The numbering system describes how to identify
the specific residues in the segment.  DAS uses a 0-based coordinate
system where the first residue is numbered "0", the second "1", and so
on.  Other numbering systems include 1-based coordinates and the PDB
numbering system which preserves the residue number for key residues
across homologous family by allowing discontinuities, insertions and
negative values as position numbers.


					Andrew
					dalke at dalkescientific.com


From nomi at fruitfly.org  Mon Mar  6 08:09:22 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 6 Mar 2006 00:09:22 -0800 (PST)
Subject: [DAS2] DAS/2 teleconference?
Message-ID: <17419.60978.358549.246997@kinked.lbl.gov>

Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
happen.
        Nomi


From dalke at dalkescientific.com  Mon Mar  6 09:14:30 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 6 Mar 2006 02:14:30 -0700
Subject: [DAS2] DAS/2 teleconference?
In-Reply-To: <17419.60978.358549.246997@kinked.lbl.gov>
References: <17419.60978.358549.246997@kinked.lbl.gov>
Message-ID: <fd0f45acc7110c7d18f8c8a9f7fbe39d@dalkescientific.com>

Nomi:
> Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
> happen.

I plan on calling in.


					Andrew
					dalke at dalkescientific.com


From Gregg_Helt at affymetrix.com  Mon Mar  6 14:03:24 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 06:03:24 -0800
Subject: [DAS2] DAS/2 teleconference?
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA0A@msex02.affymetrix.com>

Apologies for the mixup with the teleconference last week!  Yes we're
definitely on for a teleconference today at the standard time, 9:30 AM
Pacific time.

	Thanks,
	Gregg

> -----Original Message-----
> From: das2-bounces at portal.open-bio.org
[mailto:das2-bounces at portal.open-
> bio.org] On Behalf Of Nomi Harris
> Sent: Monday, March 06, 2006 12:09 AM
> To: DAS/2
> Subject: [DAS2] DAS/2 teleconference?
> 
> Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
> happen.
>         Nomi
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2


From lstein at cshl.edu  Mon Mar  6 14:49:18 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 6 Mar 2006 09:49:18 -0500
Subject: [DAS2] DAS/2 teleconference?
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0A@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0A@msex02.affymetrix.com>
Message-ID: <200603060949.19299.lstein@cshl.edu>

Hi Gregg,

I'll miss the first half hour of the call today because of an overlap with an 
NCI teleconference.

Lincoln

On Monday 06 March 2006 09:03, Helt,Gregg wrote:
> Apologies for the mixup with the teleconference last week!  Yes we're
> definitely on for a teleconference today at the standard time, 9:30 AM
> Pacific time.
>
> 	Thanks,
> 	Gregg
>
> > -----Original Message-----
> > From: das2-bounces at portal.open-bio.org
>
> [mailto:das2-bounces at portal.open-
>
> > bio.org] On Behalf Of Nomi Harris
> > Sent: Monday, March 06, 2006 12:09 AM
> > To: DAS/2
> > Subject: [DAS2] DAS/2 teleconference?
> >
> > Is there a DAS/2 teleconference tomorrow morning?  Last week it didn't
> > happen.
> >         Nomi
> >
> > _______________________________________________
> > DAS2 mailing list
> > DAS2 at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/das2
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From Gregg_Helt at affymetrix.com  Mon Mar  6 16:44:43 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 08:44:43 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>

upcoming Code Sprint, March 13-17 at Affymetrix
status reports
 
coordinate system resolution via COORDINATES element
features with multiple locations vs. alignments
features with multiple parents
???
 

From lstein at cshl.edu  Mon Mar  6 17:37:39 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 6 Mar 2006 12:37:39 -0500
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
Message-ID: <200603061237.41288.lstein@cshl.edu>

Hi,

The teleconference system now asks me for a passcode. Previously I just had to 
enter the conference ID. What's up?

Lincoln

On Monday 06 March 2006 11:44, Helt,Gregg wrote:
> upcoming Code Sprint, March 13-17 at Affymetrix
> status reports
>
> coordinate system resolution via COORDINATES element
> features with multiple locations vs. alignments
> features with multiple parents
> ???
>
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From Gregg_Helt at affymetrix.com  Mon Mar  6 17:38:37 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 09:38:37 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA0D@msex02.affymetrix.com>

Please try again, it shouldn't ask for a passcode, but if it does, it's
1365.  There may be some glitch in our teleconferencing...

	Thanks,
	Gregg

> -----Original Message-----
> From: Brian O'Connor [mailto:boconnor at ucla.edu]
> Sent: Monday, March 06, 2006 9:36 AM
> To: Helt,Gregg
> Cc: das2 at portal.open-bio.org
> Subject: Re: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
> 
> Hi Gregg,
> 
> I tried calling in to the DAS conference call but it asked for a
> passcode in addition to the conference ID.  All I have is the
conference
> ID...
> 
> --Brian
> 
> Helt,Gregg wrote:
> 
> >upcoming Code Sprint, March 13-17 at Affymetrix
> >status reports
> >
> >coordinate system resolution via COORDINATES element
> >features with multiple locations vs. alignments
> >features with multiple parents
> >???
> >
> >
> >_______________________________________________
> >DAS2 mailing list
> >DAS2 at portal.open-bio.org
> >http://portal.open-bio.org/mailman/listinfo/das2
> >
> >
> >


From nomi at fruitfly.org  Mon Mar  6 17:40:26 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 6 Mar 2006 09:40:26 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
Message-ID: <17420.29706.575212.913804@spongecake.lbl.gov>

i am calling in (800-531-3250, id: 2879055) but it is then asking me for
a passcode.  i tried entering 2879055 again but that didn't work.  we
didn't used to have a passcode, did we?  can someone tell me what it is?
if you prefer not to email it, you can phone me at 510 486-5078.
     Nomi


From Gregg_Helt at affymetrix.com  Mon Mar  6 18:10:23 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 6 Mar 2006 10:10:23 -0800
Subject: [DAS2] Examples of features with multiple locations from
	biopackages server
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA10@msex02.affymetrix.com>

In the teleconference today, we?re talking about features with multiple locations, here?s an example from biopackages server:
 
<FEATURE id="feature/Affymetrix_YG-S98:3589_f_at" type="type/SO:PCR_product" name="Affymetrix_YG-S98:3589_f_at">
<LOC id="segment/chrVII" range="114942:536087:1"/>
<LOC id="segment/chrVII" range="540819:876266:1"/>
<LOC id="segment/chrVII" range="561866:562768:-1"/>
<LOC id="segment/chrVII" range="567445:569664:-1"/>
<LOC id="segment/chrVII" range="567445:818311:-1"/>
<LOC id="segment/chrVII" range="816491:932008:1"/>
</FEATURE>
?
            <FEATURE id="feature/Affymetrix_YG-S98:3750_f_at" type="type/SO:PCR_product" name="Affymetrix_YG-S98:3750_f_at">
<LOC id="segment/chrVII" range="327915:402221:1"/>
<LOC id="segment/chrVII" range="561919:562068:-1"/>
<LOC id="segment/chrVII" range="811564:811720:1"/>
<LOC id="segment/chrVII" range="823052:823197:-1"/>
</FEATURE>


From boconnor at ucla.edu  Mon Mar  6 17:36:28 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Mon, 06 Mar 2006 09:36:28 -0800
Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA0C@msex02.affymetrix.com>
Message-ID: <440C731C.5070303@ucla.edu>

Hi Gregg,

I tried calling in to the DAS conference call but it asked for a 
passcode in addition to the conference ID.  All I have is the conference 
ID...

--Brian

Helt,Gregg wrote:

>upcoming Code Sprint, March 13-17 at Affymetrix
>status reports
> 
>coordinate system resolution via COORDINATES element
>features with multiple locations vs. alignments
>features with multiple parents
>???
> 
>
>_______________________________________________
>DAS2 mailing list
>DAS2 at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/das2
>
>  
>


From dalke at dalkescientific.com  Mon Mar 13 14:00:45 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 06:00:45 -0800
Subject: [DAS2] format information for the reference server
Message-ID: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com>

(NOTE: the open-bio mailing lists were moved from portal.open-bio.org
to lists.open-bio.org.  My first email on this bounced because I
sent to the old email address.)

Summary of questions:
   - what does it mean for the annotation server to list the formats
       available from the reference server?
   - can the reference server format information be moved to the
       segments document?
   - are there formats which will only work at the segment level and
       not at the segments level (ie, formats which don't handle multiple
       records)?

Something's been bothering me about the segments request.

Currently the DAS sources request responds with something like

<SOURCES>
   <SOURCE>
    <VERSION>
      <CAPABILITY type="segments" query_url="http://blah/seq">
         <FORMAT name="fasta" />
         <FORMAT name="agp" />
      </CAPABILITY>
   ...
</SOURCES>

This says "go to 'blah' for information about the sequence".

But it says more than that.  It provides metadata about
the reference server.  It says that the reference server can
respond in 'fasta' and 'agp' formats.

Hence the following are allowed from this URL

   http://blah/seq?format=agp  -- return the assembly
   http://blah/seq?format=fasta -- return all sequences in FASTA format

Does this mean that all annotations servers using the given
reference server must list all of the available formats?

If a client sees multiple CAPABILITY elements for the same
query_url is it okay to merge the list of supported formats?
That is, if server X says that annotation server A supports
fasta and server Y says that A supports genbank then a client
may assume A supports both fasta and genbank formats?
(This makes sense to me.)

Second, does it make sense to require the annotation servers
to list the formats on the reference server?  What about
making that information available from the segments document,
like this.

query:

   http://www.biodas.org/das/h.sapiens/38/segments.cgi

response:

<SEGMENTS>
   <SEGMENT id="abc">
     <FORMAT name="fasta" />
     <FORMAT name="agp" />
   </SEGMENT>
   <SEGMENT id="def">
     <FORMAT name="fasta" />
     <FORMAT name="agp" />
   </SEGMENT>
</SEGMENT>

A problem with this the lack of data saying that the
segments query URL itself supports multiple formats.  For
example,

   http://www.biodas.org/das/h.sapiens/38/segments.cgi?format=fasta

might support returning all of the chromosomes in FASTA format.

Are there any formats which only work at the segment level
and not at the segments level?  That is, which only work with
single gene/chromosome/contig/etc. but don't support multiple
sequences?  The only one I could think of off-hand is "raw",
since there's no concept of a "record" given a bunch of letters,
unless the usual way is to separate them by an extra newline?

If all formats are supported for both single and all segments
then here is another possible response

[possibility #1]
<SEGMENTS>
   <FORMAT name="fasta" />
   <FORMAT name="agp" />
   <SEGMENT id="abc" />
   <SEGMENT id="def" />
</SEGMENT>

I think all formats which work on the "segments" level also
work on a single segment level, so another possibility is
the following, which lets a given segment say that it supports
more formats.

[possibility #2]
<SEGMENTS>
   <FORMAT name="fasta" />
   <FORMAT name="agp" />
   <SEGMENT id="abc">
     <FORMAT name="raw" />
   </SEGMENT>
   <SEGMENT id="def" />
     <FORMAT name="raw" />
   </SEGMENT>
</SEGMENT>


Here's another, using a flag to say if a format is for a
single segment, the segments URL, or both (feel free to
pick better names!). By default it applies to both.

[possibility #3]

<SEGMENTS>
   <!-- both support FASTA retrieval -->
   <FORMAT name="fasta" />

   <!-- both support GenBank retrieval -->
   <FORMAT name="genbank" applies-to="both" />

   <!-- can only get the assembly of everything -->
   <FORMAT name="agp" applies-to="segments" />

   <!-- can only get the raw sequence for a segment -->
   <FORMAT name="raw" applies-to="segment" />
</SEGMENT>

Yet another option is

[possibility #4]
<SEGMENTS>
   <FORMATS-FOR-SEGMENTS>
     <FORMAT name="fasta" />
     <FORMAT name="genbank" />
     <FORMAT name="agp" />
   </FORMATS-FOR-SEGMENTS/>
   <FORMATS-FOR-SINGLE-SEGMENT>
     <FORMAT name="fasta" />
     <FORMAT name="genbank" />
     <FORMAT name="raw" />
   </FORMATS-FOR-SEGMENTS/>
   ..

Of these I support [possibility #1], with the ability to go
to [possibility #3] if there's ever a case where a given format
cannot be applied to both levels.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Mon Mar 13 14:29:28 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 06:29:28 -0800
Subject: [DAS2] id, url, uri, and iri
Message-ID: <cd2e10567561e030affb493ad1d55aba@dalkescientific.com>

Something to settle.

I've been using 'id' like this

>  <FEATURE id = "feature/hit12"
>           type_id = "type/est-alignment"
>           created = "2001-12-15T22:43:36"
>           modified = "2004-09-26T21:10:15" >
>
>    <LOC id="residues/Chr3" range="1201:1400:1" />
>    <PART id="feature/hit12.hsp1" />
>    <PART id="feature/hit12.hsp2" />
>    <ALIGN target_id="feature/yk12391" range="200:299" />

As Dave Howorth pointed out, most people use 'id' as an
in-document identifier, and not as an identifier to link
to other documents.  Eg, there's a "getElementById()" method
in the DOM which is mean to find DOM nodes given the id.

In looking around I found that it's keyed off of the type
(as determined by the schema) and not by the string 'id'.
I added 'xml:id' as a possible DAS attribute, which is defined
by the XML spec to work as expected for getElementById.

In private email Gregg asked about using 'uri' instead of
'id' for this.

I'm now leaning that way.  Either 'uri' or 'url' or 'iri'.
I prefer url because everyone knows what that means.  Gregg
prefers 'uri' I think because that's what allows fragment
identifiers, and because it includes things which are other
than URLs, like LSIDs.

However, the latest thing these days is an "iri" which
means "internationalized resource identifier"
   http://www.ietf.org/rfc/rfc3987.txt

I haven't read enough of it to understand it.  My first
attempt says that it's okay to use "uri" because there
are 1-to-1 mappings between uris and iris.  Also, I don't
want to test bidirectional text and I suspect there isn't yet
widely used library support for iris.

So I want to change the DAS use of 'id' to 'url' and say
"the value of the 'url' attribute is a URI".


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Mon Mar 13 15:38:58 2006
From: Steve_Chervitz at affymetrix.com (Chervitz, Steve)
Date: Mon, 13 Mar 2006 07:38:58 -0800
Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 6 Mar 2006
Message-ID: <C03AD212.1CEBE%Steve_Chervitz@affymetrix.com>

[These are notes from last week's meeting. -Steve]

Notes from the weekly DAS/2 teleconference, 6 Mar 2006

$Id: das2-teleconf-2006-03-06.txt,v 1.1 2006/03/13 15:41:03 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  CSHL: Lincoln Stein
  Sanger: Thomas Down
  Dalke Scientific: Andrew Dalke
  UC Berkeley: Nomi Harris
  UCLA: Brian O'Connor
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Agenda:
-------

upcoming Code Sprint, March 13-17 at Affymetrix
status reports
coordinate system resolution via COORDINATES element
features with multiple locations vs. alignments
features with multiple parents
???

[ Some trouble with passcode for teleconf - hopefully fixed ]

TD: The coord syst things we were hoping to discuss with Andreas who
won't make it today.
GH: We can push this off till next week.

Code Sprint
-------------
LS: At sanger mon-tues for ensembl sab meeting, able to participate
from tues pm to fri eve.
AD: Planning to come to Affy
BO: Allen and I are planning to come up to Emeryville
GH: For payment, submit expenses to affy.
Hotels? Marriott or Woodfin. Will send out rec's today.
NH: Planning to attend at affy mon-tues, thur.

[A] Ed will look into accts for andrew and brian (internet access)

GH: Plan on 9-10am phone teleconf daily. Greg can pick up people from
hotel. 

GH: Goals/deliverables for this code sprint?
LS: Write das/2 client for bioperl. Plan to plug into Gbrowse
All I need is a working server
AD: Writing writeback and locks, improving validator .
NH: Apollo and registry, feature types. Wrote a writer, can test in
AD's validator (plan to).
GH: Keep working on das/2 client for igb at affy. Hoping by then to
have an affy das/2 server up and running.
SC: Can help get it up
GH: Can we put on in our dmz, so it's publically accessible at least
for the code sprint.

[A] Steve will look into setting up publically accessible affy das/2 test
server 

TD: Working on getting an Ensembl das/2 server up.
GH: Java middle ware on top of biojava?
TD: Yes. Using the biojava to ensembl bridges.
EE: Getting IGB to use style sheets.
AD: And/or using a proper style sheet system, if you decide what I put
in there is not good enough.
BO: Looking for something to do. Hoping to start on writeback.,
Helping separate out igb model layer. Finished rpm packages in last
code sprint, this is pretty much done.
GH: Guess Allen will be working on the biopackages server.
BO: Waiting on spec for writeback.
AD: My writeup specifies how they do writeback at Sanger, overlaps
well with Lincoln's proposal. See that.

GH: Need to tighten up the read-only spec. A fair number of things to
resolve.
AD: A partial draft of 3rd version. Planning to update it before next
sprint. Examples so people can get a feel for how things go together.

GH: My agenda stuff: coord system resolution system to match
annotations on same genome coming from diff servers.

[A] Gregg will wait for Andreas to join in before discussing coordinate
issues.

GH: Feats w/ multiple locations (see email Gregg sent to the list
today with examples). Current spec says if you use >1 coord
system, you can have feats with multiple locations. Is this what we
want to say? 
GH: Allen's server has feats w/ >1 location on same coord system. Do
we want to allow or disallow? If disallow, how?
AD: Possible usecase for alignments.
GH: Feat model for bioperl. Locations have multiple parts. Feats with
mult locations feels similar to that. Do you have multple children
each with a loc, or do you use the align element?
LS: Prefers children. That's what SO ended up doing after much
arguing. Makes it easier.
GH: Enforce it with the ontology. E.g, an alignment hit has alignment
hsps. 
This forces client to understand the ontology.
LS: Consider that an hsp will have scores attached to it, different
cigar line. So you  end up with mult children anyway. An improverished
type of alignment. Can use cigar line to indicate mismatches. Can have
a single HSP and a cigar line to indicate gaps. Only one child. You
don't have to have multiple locations
GH: Looking for use case of multiple locations with PCR products...
My main concern is how much semantic knowledge the clients need to
understand these things. Nothing in the spec that restricts mult
locations.
AD: Won't client just get the multiple children and not care about
types?
GH: I gues a simple client could do that. It disturbs me that it's up
the server how to handle multip loc, childrent, vs aligmnets.
Will send an example.
LS: Yes. this is a vague area. There should be a best-practices
section in the spec.
Single match feature from begin to end. HSP children, each one covers
major gaps. Cigar line w/in hsp to cover minor gaps. Can give each hsp
an alignment score.
GH: Main diff between locn and alignment is cigar string, and cigar
string is optional.
If we're allowed to use locations to designate alignments...
LS: How about if we consolidate location and alignment: location has
an optional cigar and then do away with alignment. Generalize
location to allow for gaps.
TD: Example: Aligning an est to the genome. Falls into several blocks
of exact/near exact matching. If location has cigar line, could serve
it up as a single feature.
GH: You can do this since cigar can represent arbitrary length gaps.
TD: Neat and compact way to do it. Does this scare anyone?
GH: Sounds reasonable.
AD: Let's do it. And will put in examples of best practices.

[A] Consolidate location and alignment in spec, loc has optional cigar

GH: Feats with mult parents. Need examples to test. This is a question
to people putting up servers. Will anyone have these?
TD: Ensembl might do this. Exon shared between several transcripts. A
toss up between multiple parents vs. multiple copies of same
exon. Think mult parents is the way to do it.
LS: Flybase use multiple parents for exons in this way.
TD: Ensemble db is a many-to-many between transcripts and exons.

GH: Spec says: If you have a child in the feat document, you have to
include its parent; If you have a parent you must include it's
children. As long as this plays policy nice with that requirement, I'm
ok with it. 
GH: Anyone else see things that need to be ironed out in spec?
AD: Not yet

NH: We should write a paper about das/2. This will help get more
people using it, increase the success of the spec.
GH: Agreed -- good idea. We have lots of text in grant about the
philosophy of das/2.
NH: Can pull text from these places. Publish at a conference perhaps?
ISMB, CSB2006
GH: PLoS Bioinformatics?
NH: Conference would be nice, to involve people in discussion.
AD: Poster session is available for ISMB.
NH: Prefers a conference talk. Paper will require more finished
stable. Poster is too much work for little payoff.
AD: Ann L complains that the only paper to cite for das is an old
ref. Wants an updatable citable paper.
NH: CSB will publish a proceedings.
Genome informatics at CSHL (they don't publish though).
NH/GH: What's the best conference to get published in these days?
LS: ISMB
NH: We missed deadline for it.
LS: Biocurators meeting?
NH: Can ask Sima about. Another one: Computational Genomics (TIGR
sponsored). Not published. Submit abstracts, they select
talks. Halloween in Baltimore. If conf proceedings are published, you
can't submit to a paper, so we could go that way, get double mileage
out of it.
GH: Sounds good to get something ready for a paper rather than a
conference. Did a presentation at Bosc, Genome informatics last year.

[A] Nomi will help get paper ready for PLoS (after code sprint)

AD: Can do poster for ismb, bosc in Brazil, if I end up going.
NH: ISMB deadline is 10 May, so we should get going on it

GH: Continuation grant submission, in theory has been reviewed, but
haven't heard back. Maybe will take another month, to get score back.
Final word?
LS: Have you checked ERA Commons? They may update it there before you
get the note.


From dalke at dalkescientific.com  Mon Mar 13 15:58:29 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 07:58:29 -0800
Subject: [DAS2] definition of coordinate system attributes
In-Reply-To: <e4fac3477606826ca6dbc80a11555d0a@sanger.ac.uk>
References: <3124ef2656aa51af817f16b1b71b16a2@sanger.ac.uk>
	<e5f47011c9258aecb1e4d21291de4aac@dalkescientific.com>
	<b4ce0b58feae2b49a77191059da213c9@sanger.ac.uk>
	<c52c87ec4f95cd9186d59ba463e5b76e@dalkescientific.com>
	<e4fac3477606826ca6dbc80a11555d0a@sanger.ac.uk>
Message-ID: <e5ea1776e51e938f6dc72ea26ee37f26@dalkescientific.com>

I've been exchanging emails with Andreas

>> Me?  I don't know what it's for.  Which means I've wiped it.
>
> is this a spec change? then I need to update the source response form 
> the new devel  dasregistry ...
>
> actually the new_spec.txt says it has not been changed since feb. 
> 10th...

I had hoped to have an updated spec by now.  (After all, the conf.
call is in an hour.)  That didn't happen.  :(

I've attached what I have so far.  I'll be working on it more today,
and getting things in CVS updated.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: draft3.txt
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060313/9e1e7b22/attachment-0001.txt>
-------------- next part --------------


					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Mon Mar 13 16:47:32 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Mon, 13 Mar 2006 16:47:32 +0000
Subject: [DAS2] format information for the reference server
In-Reply-To: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com>
References: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com>
Message-ID: <e36fa2d7ad8152a2122357a01a3f4e03@sanger.ac.uk>


On 13 Mar 2006, at 14:00, Andrew Dalke wrote:

> Summary of questions:
>    - what does it mean for the annotation server to list the formats
>        available from the reference server?

should this happen? I thought that annotation servers are described by 
their "coordinate system"
then the registry provides a list of available references servers that 
provide the sequences for this.

> Something's been bothering me about the segments request.
>
> Currently the DAS sources request responds with something like
>
> <SOURCES>
>    <SOURCE>
>     <VERSION>
>       <CAPABILITY type="segments" query_url="http://blah/seq">
>          <FORMAT name="fasta" />
>          <FORMAT name="agp" />
>       </CAPABILITY>
>    ...
> </SOURCES>
>
> This says "go to 'blah' for information about the sequence".
>
> But it says more than that.  It provides metadata about
> the reference server.  It says that the reference server can
> respond in 'fasta' and 'agp' formats.

I think an annotation server should not know/provide this information
this should come from the reference server / registry


> If a client sees multiple CAPABILITY elements for the same
> query_url is it okay to merge the list of supported formats?

that does not sound clean.

> That is, if server X says that annotation server A supports
> fasta and server Y says that A supports genbank then a client
> may assume A supports both fasta and genbank formats?
> (This makes sense to me.)

the client should ask the reference server directly what it speaks /
rely on the registration server to have validated that server A speaks
indeed what  it says it does.

Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Gregg_Helt at affymetrix.com  Mon Mar 13 17:13:14 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 13 Mar 2006 09:13:14 -0800
Subject: [DAS2] DAS/2 code sprint conference starting now
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA2A@msex02.affymetrix.com>

We just started the daily DAS/2 code sprint teleconference at
Affymetrix.
US number #: 800-531-3250
International #: 303-928-2693
Conference ID: 2879055
Passcode: 1365
 
 
From Gregg_Helt at affymetrix.com  Mon Mar 13 20:48:50 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 13 Mar 2006 12:48:50 -0800
Subject: [DAS2] Problem with name feature filter on biopackages server
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>

I'm looking into adding the ability in the IGB DAS/2 client to retrieve
features by name/id.  Trying this out with the biopackages server almost
gives me what I want:
 
http://das.biopackages.net/das/genome/yeast/S228C/feature?name=YGL076C
 
except that in the returned XML the parent feature (YGL076C) does not
list it's children as <PARTS>, though the children list YGL076C as
<PARENT>.  Any ideas?
 
            thanks!
            gregg
 

From nomi at fruitfly.org  Mon Mar 13 22:32:49 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 13 Mar 2006 14:32:49 -0800 (PST)
Subject: [DAS2] Where to publish [was Re: Notes from the weekly DAS/2
	teleconference, 6 Mar 2006]
In-Reply-To: <C03AD212.1CEBE%Steve_Chervitz@affymetrix.com>
References: <C03AD212.1CEBE%Steve_Chervitz@affymetrix.com>
Message-ID: <17429.62225.230884.764469@kinked.lbl.gov>

On 13 March 2006, Chervitz, Steve wrote:
 > NH/GH: What's the best conference to get published in these days?
 > LS: ISMB
 > NH: We missed deadline for it.
 > LS: Biocurators meeting?
 > NH: Can ask Sima about.

Sima said:
> Next biocurator meeting is probably in early 2007 in the UK. No plans at
> the moment to publish the proceedings, however.
> 
> I think publishing soon in PLoS is a good idea.


From dalke at dalkescientific.com  Mon Mar 13 23:45:04 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 15:45:04 -0800
Subject: [DAS2] URIs for sequence identifiers
Message-ID: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>

Proposals:
   - do not use segment "name" as an identifier
       - rename it "title" (human readable only)
       - allow a new optional "alias-of" attribute which is the
            link to the primary identifier for this segment

   - change the feature location to use the segment uri

   - change the feature filter range searches so there is a new "segment"
      keyword and so the "includes", "overlaps", etc. only work on
      the given segment, as
         segment=<uri>
         inside=$start:$stop
         overlaps=$start:$stop
         contains=$start:$stop
         identical=$start:$stop

   - If 'includes', 'overlaps', etc. are given then the 'segment'
       must be given (do we need this restriction?  It doesn't make
        sense to me to ask for "annotations on 1000 to 2000 of anything"

   - only allow at most one each of includes, overlaps,
       contains, or identical (do we need this restriction?)

   - multiple segments may be given, but then range searches
       are not supported (do we need this restriction?)

Discussion:

The discussion on this side of things was based on today's phone
conference.  Andreas needs data sources to work on multiple
coordinate spaces.

To quote from Andreas:
> There are several servers that understand more than one coordinate
> system and can return the same type of data in different coordinates.  
> (depending on which type of accession code/range was used for the
> request ) E.g. there are a couple of zebrafish servers that speak
> both  in Chromosome and Scaffold coordinates. (reason perhaps
> being that zebrafish is an organism that seems to be very difficult
> to assemble ?)

The current DAS system does not support this because of how
it does segment identifiers.

The current scheme looks like this:

<!-- sources.xml -->
<SOURCES ...>
   <SOURCE ...>
    <VERSION ...>
      <COORDINATES authority="Andreas" source="Scaffold" ... />
      <COORDINATES authority="Andreas" source="Chromosome" ... />
      <CAPABILITY type="segments" query_id="http://sanger/andreas/" />
      ....

Problem #1: We need two entry points, one to view the segments
in Scaffold space, the other to view them in Chromosome space.

Solution #1 (don't like it though).
Add a "source=" attribute to the CAPABILITY and allow multiple
segments capabilities

<!-- sources.xml -->
<SOURCES ...>
  <SOURCE ...>
   <VERSION ...>
    <COORDINATES authority="Andreas" source="Scaffold" ... />
    <COORDINATES authority="Andreas" source="Chromosome" ... />
    <CAPABILITY type="segments"
       query_id="http://sanger/andreas/scaffolds.xml" source="Scaffold"  
/>
    <CAPABILITY type="segments"
       query_id="http://sanger/andreas/chromosomes.xml"  
source="Chromosome" />
     ....


I don't like it because it feels like the COORDINATES and
CAPABILITY[type="segments"] field should be merged.  Still, I'll
go with it for now.

Problem #2: feature searches return features from either namespace

Consider search for name=*ABC* (that is, "ABC" as a substring in
the "name" or "alias" fields).  Then the result might be

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment="A/100:200" />
   </FEATURE>
</FEATURES>

Where "A" is a short-hand notation for one of the segments?
Which one?  The client goes to the segment servers:

Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
  <SEGMENT id="http://whatever.com/ChromosomeA" name="A" length="2000" />
</SEGMENTS>

Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
  <SEGMENT id="http://whatever.com/ScaffoldA" name="A" length="2000" />
</SEGMENTS>

The segment name "A" matches either ChromosomeA or ScaffoldA, and
there's no way to figure out which is correct!


This comes because our own naming scheme is not very good at
being globally unique.  We could fix it by also stating the
namespace in the result, as

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment="A/100:200" source="Scaffold"/>
   </FEATURE>
</FEATURES>

Gregg asked "why don't we just use the URI"?

After a long discussion we decided to propose just that.
That is, get rid of the "name" attribute.  Instead, use a
"title" attribute which is human readable and an optional
"alias-of" which contains is the primary identifier for
the given segment.

The alias-of value is determined by the person who
defined the COORDINATES.  It could be a URL.  It could
a URI.  It does not need to be resolvable (though it
should - perhaps to a human readable document?  Or to
something which lists all known aliases to it?)

That is, the segments document will look like this

Query: http://sanger/andreas/scaffolds.xml"
Response:
<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ChromosomeA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Chromosome/A"
     title="Chromosome A" />
</SEGMENTS>

Query: http://sanger/andreas/chromosomes.xml"
<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ScaffoldA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Scaffold/A"
     title="Scaffold A" />
</SEGMENTS>

This has a few implications.  Feature locations must be given
with respect to the segment uri, as

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment_uri="http://whatever.com/ScaffoldA" range="200:300"/>
   </FEATURE>
</FEATURES>

Given this segment_uri a client can figure out if it is in
Scaffold or Chromosome space because it can check all of the
URIs in each space for a match.


The other change is in range searches.  Consider the current
scheme, which looks like

   includes=ChrA
   includes=A/100:300

The query is of the form $ID or $ID/$start:$end.  It needs to be
changed to support URLs.  For examples,

   includes={http://www.whatever.com/ChromosomeA
   includes={http://www.whatever.com/ScaffoldA}/100:300

We couldn't come up with a better syntax.  Then Gregg asked
"why do we need multiple includes"?

That is, the current syntax supports
   includes=ChrA/0:1000;includes=ChrB/2000:3000;includes=ChrC/5000:6000

to mean "anywhere on the first 1000 bases of ChrA, the 3rd 1000
bases of ChrB, or the 6th 1000 bases of ChrC".

Given the query language, we're looking for way to write that
using URLs, as

    
includes={http://www.whatever.com/ChromosomeA}0:1000;includes={http:// 
www.whatever.com/ChromosomeB}:2000:3000;includes={http:// 
www.whatever.com/ChromosomeC}:5000:6000;

However, that's a very unlikely query.  What if we split the
"includes", "overlaps", etc. into "includes_segment" and  
"includes_range".
In that case:

   old-style:
includes=A/500:600
   new-style:
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_range=500:600

   old-style:
includes=A/500:600,Chr3/700:800
   new-style:
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_range=500:600;
includes_range=700:800

   old-style:
includes=A/500:600,D/700:800
   new-style: -- NOT POSSIBLE

   old-style:
includes=A/500:600,D/500:600
   new-style: (not likely to be used in real life)
includes_segment=http://www.whatever.com/ChromosomeA; 
includes_segment=http://www.whatever.com/ChromosomeD; 
includes_range=500:600;

This no longer allows searches with subranges from different segments.

The again -- who cares?  Those sorts of searches are strange.

Talking some more.  Who needs the ability to do more than one
includes / overlaps / etc. query at a time?  Gregg wants the
ability to do a combination of includes and overlaps, but
that's all.

We can simplify the server code by only supporting one
inside search, one contains search, and/or one overlaps
search, instead of the current system which allows a more
constructive geometry, and we can move the segment id out
into its own parameter.

Allen said that that would prevent more complicated types
of analysis on the server, but that anyone doing more
complicated searches would pull the data down locally.

Does anyone want to do more than one overlaps search at
at time?  More than one contains search at a time?  More
than one identical search at a time?

(For that matter, does anyone actually want to do a "identical"
search?  Gregg thinks it will be useful to find any other
annotations which are exactly matching the given range.
I think that might be better with a "include"/"exclude" combination
to have start/end positions within a couple of bases from
the specified range.)

PROPOSAL:
   Change the range query language to have

segment=  <<the url of the segment to search>
inside= $start:$end
overlaps= $start:$end
contains= $start:$end

Example:

segment=http://whatever.com/ChromosomeD;inside=5000:6000

Also, only allow at most one includes, one overlaps, and
one contains (unless people want it).  I'm less sure about
the need for this restriction.  It might be as easy to
implement the more complex search as it would be to check
for the error cases.


					Andrew
					dalke at dalkescientific.com


From ed_erwin at affymetrix.com  Mon Mar 13 23:56:56 2006
From: ed_erwin at affymetrix.com (Ed Erwin)
Date: Mon, 13 Mar 2006 15:56:56 -0800
Subject: [DAS2] URIs for sequence identifiers
In-Reply-To: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
Message-ID: <441606C8.3070902@affymetrix.com>


Andrew Dalke wrote:

>>There are several servers that understand more than one coordinate
>>system and can return the same type of data in different coordinates.  
>>(depending on which type of accession code/range was used for the
>>request ) E.g. there are a couple of zebrafish servers that speak
>>both  in Chromosome and Scaffold coordinates. (reason perhaps
>>being that zebrafish is an organism that seems to be very difficult
>>to assemble ?)
> 
> 
> The current DAS system does not support this because of how
> it does segment identifiers.
> 
> 
> Problem #1: We need two entry points, one to view the segments
> in Scaffold space, the other to view them in Chromosome space.
> 
> Solution #1 (don't like it though).
> Add a "source=" attribute to the CAPABILITY and allow multiple
> segments capabilities


> Problem #2: feature searches return features from either namespace
> 

A different solution:

Scaffold and Chromosome coordinate systems are served by separate DAS/2 
servers.  Each server returns data from one and only one namespace.

Those separate servers can, behind-the-scenes, use the same database.

DAS/2 clients, like IGB, would choose to connect to either the 
Scaffold-based server or the Chromosome-based server, but not usually to 
both at once.

Does this handle all the issues?

Ed


From dalke at dalkescientific.com  Tue Mar 14 00:12:52 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 13 Mar 2006 16:12:52 -0800
Subject: [DAS2] URIs for sequence identifiers
In-Reply-To: <441606C8.3070902@affymetrix.com>
References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
	<441606C8.3070902@affymetrix.com>
Message-ID: <54829d8554d9b044908965d80b158c60@dalkescientific.com>

Ed:
>> Problem #2: feature searches return features from either namespace
>
> A different solution:
>
> Scaffold and Chromosome coordinate systems are served by separate 
> DAS/2 servers.  Each server returns data from one and only one 
> namespace.
>
> Those separate servers can, behind-the-scenes, use the same database.
>
> DAS/2 clients, like IGB, would choose to connect to either the 
> Scaffold-based server or the Chromosome-based server, but not usually 
> to both at once.
>
> Does this handle all the issues?

Here's the email I got from Andreas when I proposed this.


>>> There may be more than one COORDINATE element if ... (XXX why?)
>
> There are several servers that understand more than one coordinate 
> system and
> can return the same type of data in different coordinates. (depending 
> on which type of accession code/range was used for the request )
> E.g. there are a couple of zebrafish servers that speak both  in 
> Chromosome and Scaffold coordinates.
> (reason perhaps  being that zebrafish is an organism that seems to be 
> very difficult to assemble ?)


>> Will there be separate CAPABILITY items for each source?
>
> no. if there are then this should be registered as two independent 
> servers.

(but see clarification later)

> Allowing multiple coordinate systems per server is a way to slightly 
> reduce the already long list of known
> servers. Currently there are about 90 in the registry (+10 in the last 
> few weeks...) and there still are about 20 more
>  which have not been registered (and are provided by the BioSapiens 
> project).

>> Long for who?  It isn't that much data.
>
> It is long for somebody who browses manually through the ensembl DAS 
> configuration and searches for a DAS source to add to.
> It a "long" list for myself to read through the DAS server list at
> http://das.sanger.ac.uk/registry/listServices.jsp
> and although I know this list pretty well,  it seems to me a lot of 
> text/descriptions, etc.


>> There is only one reference server for an annotation server.
>
> I think it should be one reference server per coordinate system.


>> But if there are two COORDINATES elements, and you say that
>> each has its own reference server, then aren't you saying that
>> a single annotation server may have multiple reference servers?
>
> yes. i believe that this should be possible.

>>  What's the concern about having
>> no more than one coordinate per data source?
>
> Just last friday somebody asked me how to add a DAS server that has 
> two coordinate systems to different Ensembl views ( ContigView and 
> GeneView)
> Her initial solution was to provide multiple DAS sources
> http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_211
> and
> http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_219
>
> but I think this could be joint into a single server.


In any case, I think the proposal I outlined in the previous email
makes things cleaner even without support for multiple coordinate
systems on the same server.

					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Tue Mar 14 04:22:36 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Mon, 13 Mar 2006 20:22:36 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day one, 13 Mar 2006
Message-ID: <C03B850C.1CF48%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day one, 13 Mar 2006

$Id: das2-teleconf-2006-03-13.txt,v 1.1 2006/03/14 04:31:36 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Sanger: Andreas Prlic
  Dalke Scientific: Andrew Dalke (at Affy)
  UC Berkeley: Nomi Harris (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


General note: 
Passcode is now required to enter teleconf.
This is a change in their system.


Issue: Continuation Grant
-------------------------

gh: no word yet.


Issue: Coordinate System
------------------------

ad: question of what happens when there are multiple coordinate
systems for an assembly.

auth and source,
source: contig space, scaffold space
auth: organization (e.g. ncbi, ucsc)

gh: not enough to get uniqueness.
ncbi, genome, human is not enough, need version
to uniquely id the coord system

ad: auth, source, species, version identification string
gh: use case: need to know whether uris for two versioned source
refer to the same genome.

gh: ncbi version numbers are separate from organism info, eg. v35.

ad: we could have a service for mapping strings

gh: idea - every server can say this assembly name is same as
that. Clients could chain together statements from multiple servers.
For the affy das server used by igb, we now have a synonyms file on
our server which igb reads. It's a pain to maintain.

ad: type of alignment server?
gh: a synonym server. Here's a uri, give me a list of synonyms that
refer to the same thing.

This is something tho talk more about when Andreas is on line.

[Andreas joins in.]

GH: How would a das server verify the version info in a sources
document point to same genome assembly?
AP: You would check auth=ncbi, vers=35, taxid=human
AP: In protein structure space, you check verison on every object you
work with. Protein seq.
gh: so we have to map version info on sequences as well as genome
assemblies. 
gh: use case: two segment responses from diff servers, diff uris for
the diff sequences, how you know they are refering to the same seq.
name=chromosome21 vs name=chr21?
ad: we require the same name for the same segments.
gh: going to fall apart fast. no way to enforce it. People use 1, I,
chr1, chromI.
ee: can put this in the validation suite.
aday: yes.
gh: but what do you use for name: accession # for entry, string chr1,
etc.
gh: important since this is the name that goes to user.
ad: could have one slot for computer to use, one for human consumption.
ad: for segments there seem to be two diff ids: url,
ad: the point of having special ids for segments is segment
equivalence from different servers. Separate coordinates element that
says how to merge things together. Identifiers in here that are just
coordinate space ids, not necessarily for human use. Only for identifying
coords.
gh: but how do we get people to use it?

sc: what about the idea of using checksums as identifiers for a seq?
ad: problem of duplicate seqs in an assembly. eg., same seq from chr1
and chr9.
gh: if they are the same seq they should get the same id.
ad: don't you want to know if there is a region on chr1 that is an
exact duplicate of a region on chr9?
sc: we could create the checksum on source:sequence

gh: useful to have a central place to ask for diff names for the same
coord system.
ad: uniqueness idea: coords element, has: auth, source, version,
species (optional) 
uniqueness says these are the names you use.
gh: this can fail. What do we say happens when it fails? Should there
be a way of resolving it.
ad: this is where your synonym table comes it. Publish it?
gh: maybe as part of the registry, knows

ap: there isn't a big variety in naming because there aren't many
people providing assemblies.
gh: we already have 10 different synonyms for an assembly
ee: this has some performance impact on igb. should have to do it.
ap: we should say this is how naming works.
gh: will fail.

ad: is this required for this version of the spec?
gh: need something that can be used now.
aday: without hardwiring
gh: if we don't agree during the code sprint, then it won't happen for
everyone else.
aday: using roman numerals for yeast since sgd uses it.
ee: trouble with chrX

ad: andreas: is there a place for naming of segments to use
ap: no, something for the reference server, not coords
ad: given these coords, here are the names that are used.
ap: same as reference server.

gh: maybe registry should provide: here's a coord system and here are
the names you can use for
ap: you would get a long list for proteins
aday: a user who wants to

gh: question for brian g: LSID, when you come across this for LSIDs,
ncbi is auth for human genome assembly yet they have no lsid for their
assembly, how do people refer to their lsid when there's no authority
to say what it is?
bg: you can't, no one is the authority. but you can write a resolver
that queries ncbi under the cover, in your resolver you make ncbi the
authority of the lsid, add namespace, object id. Then everyone has to
know that your resolver is hosted at some site somewhere. So there is
no satisfactory answer. It's a problem if the authority does not host
the resolver.
bg: I'm at the w3c meeting at mit, providing a webified resolver, they
would host a resolver, everyone would know to go to a well-known web
address. 

bg: you start a convention, enforce it, give error if people don't
use it.
gh: thinking we need it associated with registry.
ap: ref server + coord system, provides ids that can be used,
gh: so other ids can be used, but registry server wouldn't support it.

ad: site has ftp site for downloading chromosomes, contains names for
different segments in the file. How do I go from the ids in ths file
to the ids that Andreas describes.
To make my annotations in the same space. Mapping from file from ncbi.
bg: what are your use cases? write back to server?
ad: user publishing locally,
bg: you make a ref server.
gh: experience from das1 is that everyone makes their own reference
server and refers to it from their annotation server, using different
names. 
ad: new tag 'coordinates'
gh: like enforcing common names at registry server. Can use their own
names, they just won't be allowed to post on the registry.

ad: need documentation
ap: could point to docn on reference server

bg: workflow1: fish researcher looking for abberant regions in chr7,
11 and 3, singled out the abctransporter gene. How does that work in
das/2? type 'abc' in web page for reference server? This is a gene name.
ad: your client browser can go to to registry to find servers that
host the assemblies for your fish. Go to those reference servers, do
searches there. Will go to coord system, get a segments document, get
display chromosome by title.
gh: get a das features xml document saying the sequence and
coordinates.
gh: our discussion here is on getting the diff.
ad: we don't have anything on coordinates saying which is the latest
version.

bg: latest build may have changed their gene coordinate.
gh: mapping servers is part of our continuation grant. Can push an
annotation on one assembly to another assembly.
bg: a hard thing.
gh: that's why where enlisting UCSC to do it!

ad: Topic: id, url, uri, iri (see email)
gh: likes uri, not url. Some things aren't really urls
(resolvable). Iri might work.

ad: multiple coord elements for same ref server.
ap: originally there was one, but some use two, zebrafish guy chrom
and scaffold coordinates. or chromosomes vs. gene ids. same types,
different accession codes and features.
ad: if you have graphical browser, do you get scaffolds or
chromosomes.
ap: depends on your view.
gh: if you do a segments query, do you get segments and contigs?
ap: depending on the coordinate system of the requrest.
ad: one capabilities for scaffolds and one for chromosomes?
gh: maybe

Deliverables:
[A] gregg: by end of week, load stuff from multiple servers, compare in the
same view.

[A] steve will work on getting gregg's das/2 server up and running.

gh: trouble with biopackages.net server
aday: possible power outage interference.

gh: target filters have been  dropped.
aday: yay!


From dalke at dalkescientific.com  Tue Mar 14 15:14:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 07:14:44 -0800
Subject: [DAS2] use cases
Message-ID: <8bc46502eb164882394a3f4acbe08987@dalkescientific.com>

I think these cover the basic use cases.  Let me know if there
are other reasonable ones I should add.

Use Case #1

Biologist viewing genomic region wants to add information
from server www.biodas.org/das2/ .

Example of use:
   - Go to "open DAS server" option.  Type/paste URL for
DAS server.
   + DAS viewer connects to server, verifies that it
annotates the same sequence source and has under (say)
10 types so it makes a new track for each type and does
a request for all the features in the current display.


Use Case #2

Biologist wants all lac repressors on build 12 of mouse.

Example of use:
   - Start DAS viewer.  Go to "find server" option.  Select
"mouse" from the list of "model organisms".  Select "build 12"
from a pull-down menu of build descriptions.  Select all
the listed servers.
   - Go to "find annotations" option

Now what?  Is "lac repressor" a name?  Is it a combination
of a name and ontology term?  Is it a pure ontology term?


Use Case #3

Biologist wants to find all the annotation servers for the most
recent build of H. sapiens.

Example of use:
  - Start DAS viewer.  Go to "find server" option.  Type "human"
(or "H. sapiens" or "Homo sapiens").  Search.
  + DAS viewer consults internal NCBI taxonomy table to get taxid.
DAS viewer displays all matches.
  - Sort by build date, select all matching servers by hand

Problem:
   DAS has no field to search by build date


Use Case #4

Bioinformaticist wants to make annotations available for
build v32 of human.

Example of use:
   - Go to registry server to get a human-readable description
of the COORDINATES fields for build v32.
   - decide to point people to a reference server instead of
providing local sequence data
   - create the sources, types and features document
   - put them on a web server
   - go to registry and submit site for future inclusion


Use Case #5

IT wants people to use local mirrors of reference
server when possible.

Example of use:
   - set up a local registry server
   + server connects to Andreas' registry server and downloads
all the data
   + server rewrites "segments" sections to use local server
   - configure all DAS viewers to consult local registry server


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Tue Mar 14 15:13:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 07:13:44 -0800
Subject: [DAS2] using 'uri' instead of 'id'
Message-ID: <9779f55861a4e800d0d21ec8d96deb8c@dalkescientific.com>

Okay, I'm convinced.

Where things in the spec use 'id' they will now use 'uri'.

There are going to be a few wide-spread but shallow
changes because of this.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Tue Mar 14 16:09:12 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 08:09:12 -0800
Subject: [DAS2] segments and coordinates
Message-ID: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com>

Summary:  I want to
   - move the COORDINATE element inside of the
         CAPABILITY[type="segments"] element

   - add a 'created' timestamp to the COORDINATE (for sorting by time)

   - add a unique 'uri' identifier attribute to the COORDINATE
      (two coordinates are equal if and only if they have the same id)

   - have that identifier be resolvable, to get information about
       the coordinate system (but perhaps leave the contents for a
       future spec)

In writing the documentation I've been struggling with
COORDINATES.  No surprise there.

The current spec has COORDINATES and the "segments" capability
as different elements, like

<COORDINATES source="Chromosome" authority="NCBI" version="v22"
        taxid="9606" created="2006-03-14T07:27:49" />
<CAPABILITY type="segments"
     query_id="http://localhost/das2/h.sapiens/v22/segments" />

(Note the 'created' timestamp to sort a list of coordinates
by the time it was established.)

With the current discussion on multiple coordinates, it
looks like there is a 1-to-1 relationship between a COORDIANTES
record and a CAPABILITY record.  As that's the case I want
to merge them together, as in (note change from "_id" to "_uri")


<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES source="Chromosome" authority="NCBI" version="v22"
          taxid="9606" created="2006-03-14T07:27:49" />
</CAPABILITY>

In talking with Andreas I think he agrees that this makes sense.


Second, there's a question of identity.  When are two coordinates
the same?  Is it when they have the same
   (authority, source, version)
the same
   (authority, source, version, taxid)

Since taxid is optional, what if one server leaves it out;
are the two still the same?

I decided to solve it with a unique identifier.  Two
COORDINATES are the same if and only if they have the
same identifier.  That identifier just happens to be
a URI.  It does not need to be resolvable (but should
be, with the results viewable at least for humans).

Let's say that
   http://das.sanger.ac.uk/registry/coordinates/ABC123
is the identifier for:
   authority=NCBI
   version=v22
   taxid=9606
   source=Chromosome
   created=2006-03-14T07:27:49

Then the following are equivalent.  The only difference is the
number of properties defined in the COORDINATES tag.

<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES 
uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" />
</CAPABILITY>


<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
       source="Chromosome"/>
</CAPABILITY>


<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
      source="Chromosome" authority="NCBI" version="v22" taxid="9606"
      created="2006-03-14T07:27:49" />
</CAPABILITY>


In theory these extra values don't need to be in the COORDINATES
tag.  They are knowable given the uri.  But that requires a
discovery mechanism for the properties (eg, the COORDINATES identifier
might need to be retrievable, with some format or other).

There is the possibility of value mismatch, but as Andreas pointed
out the registry server can do that validation pretty easily.


I mentioned property discovery earlier.  Given a coordinates URI
there are three things you might want to know:
   - what is the full list of coordinate system properties?
   - what is the authoritative reference server for the coordinates?
   - are there alternate reference servers?

What if that was resolvable (doesn't need to be defined for DAS,
so this is hypothetical) into something like

<COORDINATE-SYSTEM doc_href="something for humans to read">
   <!-- definitive information about this coordinate system -->
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
       source="Chromosome" authority="NCBI" version="v22" taxid="9606"
       created="2006-03-14T07:27:49" />
   <SEGMENT-SERVER uri="http://whatever/" is-authoritative="yes" />
   <SEGMENT-SERVER uri="http://mirror1/"/>
   <SEGMENT-SERVER uri="http://mirror2/"/>
   <SEGMENT-SERVER uri="http://mirror3/"/>
</COORDINATE-SYSTEM>

(Hmmm, those are some ugly names.  I usually shy away from '-'s
in element and attribute names.)


OR, what if the authoritative URL also implemented the segments
interface, and we added a COORDINATES element to it?  Errr, I
don't like that.  We will be in charge of the coordinate
system URIs but we won't be in charge of the primary reference
server.

Use Case #6.

NCBI releases a new human build.  Ensembl releases annotations
for it and wants to put the information with Andreas' registry.

Example of use:
   - Set up an Ensembl reference server and annotation server
        for the new build; test it out
   - Create a new coordinate system record on the registry
      - fill in the species, source, doc_href, etc. fields
      - when finished the result is a URL, tied to coordinate info
   - Stick the COORDINATES information in the versioned
        source record
   - Tell the registry server to register the given versioned
        source URL


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Tue Mar 14 16:21:54 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 08:21:54 -0800
Subject: [DAS2] today's sprint meeting
Message-ID: <b16e929488b3a77c0771fa9297984cff@dalkescientific.com>

Gregg can't make it this morning and asked that I let today's
meeting.  Here are the things I would like to talk about:

== segment identifier.

Quoting from my email yesterday

   - do not use segment "name" as an identifier
       - rename it "title" (human readable only)
       - allow a new optional "alias-of" attribute which is the
            link to the primary identifier for this segment

<SEGMENTS>
  <SEGMENT uri="http://whatever.com/ChromosomeA" length="2000"
     alias-of="http://www.ncbi.nlm.nih.gov/human/v32/Chromosome/A"
     title="Chromosome A" />
</SEGMENTS>


   - change the feature location to use the segment uri

<FEATURES>
   <FEATURE id="F0001" type_id="T0001">
     <LOC segment_uri="http://whatever.com/ScaffoldA" range="200:300"/>
   </FEATURE>
</FEATURES>

   - change the feature filter range searches so there is a new "segment"
      keyword and so the "includes", "overlaps", etc. only work on
      the given segment, as
         segment=<uri>
         inside=$start:$stop
         overlaps=$start:$stop
         contains=$start:$stop
         identical=$start:$stop

http://biodas.org/feature.cgi?segment=http://whatever.com/ChromosomeD; 
inside=5000:6000
(with URL escaping rules for the query string that's
       
...feature.cgi? 
segment=http%3A%2F%2Fwhatever.com%2FChromosomeD&inside=5000%3A6000

   - If 'includes', 'overlaps', etc. are given then the 'segment'
       must be given (do we need this restriction?  It doesn't make
        sense to me to ask for "annotations on 1000 to 2000 of anything"

   - only allow at most one each of includes, overlaps,
       contains, or identical (do we need this restriction?  Then again,  
Gregg
       only needs a single includes and a single overlaps; perhaps make  
this
       even more restrictive?)

   - multiple segments may be given, but then range searches
       are not supported (do we need this restriction?)

Consensus on this side seems to be fine.  The biggest worry is the
increasing use of URIs in URL query strings.


== coordinate systems

Quoting from an email I wrote recently

   - move the COORDINATE element inside of the
         CAPABILITY[type="segments"] element

   - add a 'created' timestamp to the COORDINATE (for sorting by time)

   - add a unique 'uri' identifier attribute to the COORDINATE
      (two coordinates are equal if and only if they have the same id)

Result looks like

<CAPABILITY type="segments"
      query_uri="http://localhost/das2/h.sapiens/v22/segments">
   <COORDINATES uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
      source="Chromosome" authority="NCBI" version="v22" taxid="9606"
      created="2006-03-14T07:27:49" />
</CAPABILITY>


   - have that identifier be resolvable, to get information about
       the coordinate system (but perhaps leave the contents for a
       future spec)

== use 'uri' instead of 'id' in the spec

I've decided to go with 'uri' instead of 'id' (or 'url' or 'iri')
in its various uses in the spec.

== churn

My feeling is this is the last major churn.  I'm not able to keep
up with the documentation writing, which makes it hard for people
to get things done.

Should I work with people today on getting data sources working
and developing example data files for people to review?  That is,
examples which show and explain the various element in the spec?
I figure more people work from example than from spec description.

					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Tue Mar 14 16:35:07 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue, 14 Mar 2006 16:35:07 +0000
Subject: [DAS2] URIs for sequence identifiers
In-Reply-To: <441606C8.3070902@affymetrix.com>
References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com>
	<441606C8.3070902@affymetrix.com>
Message-ID: <0cd005042c73d6080c568576a08bb987@sanger.ac.uk>

>
> A different solution:
>
> Scaffold and Chromosome coordinate systems are served by separate DAS/2
> servers.  Each server returns data from one and only one namespace.
>
> Those separate servers can, behind-the-scenes, use the same database.
>
> DAS/2 clients, like IGB, would choose to connect to either the
> Scaffold-based server or the Chromosome-based server, but not usually  
> to
> both at once.
>
> Does this handle all the issues?


Hm I see this as a possibility but what about the following:


<SOURCES>
<SOURCE id="DS2_1" title="yeastdas2server1"  
doc_href="http://cvs.biodas.org/cgi-bin/viewcvs/viewcvs.cgi/das/das2/ 
new_spec.txt?rev=1.6&cvsroot=biodas&content-type=text/vnd.viewcvs- 
markup">
?
	<VERSION id="latest" created="2006-02-08">
	<MAINTAINER email="allenday at ucla.edu"/>
?
	<COORDINATES taxid="4932" source="Chromosome" authority="SGD"  
test_range="chrVII/364251:366080">
	<VERSION name="32"/>
	</COORDINATES>
	<CAPABILITY type="segments"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/segment"/>
	<CAPABILITY type="types"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/type"/>
	</VERSION>
</SOURCE>

<SOURCE id="DS2_2" title="yeastdas2server2"  
doc_href="http://cvs.biodas.org/cgi-bin/viewcvs/viewcvs.cgi/das/das2/ 
new_spec.txt?rev=1.6&cvsroot=biodas&content-type=text/vnd.viewcvs- 
markup">
?
	<VERSION id="latest" created="2006-02-08">
	<MAINTAINER email="allenday at ucla.edu"/>
?
	<COORDINATES taxid="4932" source="Gene_ID" authority="SGD"  
test_range="ydr409w">
	<VERSION name="32"/>
	</COORDINATES>
	<CAPABILITY type="segments"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/segment"/>
	<CAPABILITY type="types"  
query_id="http://das.biopackages.net/das/genome/yeast/S228C/type"/>
	</VERSION>
</SOURCE>
</SOURCES>


This would be how to write one server which has two coordinate systems.  
according to the "one coord sys/server" rule.
I think it would be shorter to provide two coordinates sections for  
that and only one source description...


--- fyi, a yeast by Gene_ID server is  e.g.
http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_169


Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From ap3 at sanger.ac.uk  Tue Mar 14 16:48:09 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue, 14 Mar 2006 16:48:09 +0000
Subject: [DAS2] segments and coordinates
In-Reply-To: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com>
References: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com>
Message-ID: <ad94ac7cea5c8a07117d4785879b8a79@sanger.ac.uk>


On 14 Mar 2006, at 16:09, Andrew Dalke wrote:

> Summary:  I want to
>    - move the COORDINATE element inside of the
>          CAPABILITY[type="segments"] element

Is this really needed?

> The current spec has COORDINATES and the "segments" capability
> as different elements, like
>
> <COORDINATES source="Chromosome" authority="NCBI" version="v22"
>         taxid="9606" created="2006-03-14T07:27:49" />
> <CAPABILITY type="segments"
>      query_id="http://localhost/das2/h.sapiens/v22/segments" />


> With the current discussion on multiple coordinates, it
> looks like there is a 1-to-1 relationship between a COORDIANTES
> record and a CAPABILITY record.  As that's the case I want
> to merge them together, as in (note change from "_id" to "_uri")

I think hat this is a many to many relationship.
Do you still want to provide the link to the reference server from an 
annotation server?
This is not needed because the coordinates describe the reference 
server sufficiently.

Annotation servers do not need the segments capability - only the 
features capability.


> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES source="Chromosome" authority="NCBI" version="v22"
>           taxid="9606" created="2006-03-14T07:27:49" />
> </CAPABILITY>
>
> In talking with Andreas I think he agrees that this makes sense.

If you really *want* to have the link back from the annotation server 
to the reference then
I would propose to put capability under coordinates - i.e. the other 
way round.


> econd, there's a question of identity.  When are two coordinates
> the same?  Is it when they have the same
>    (authority, source, version)
> the same
>    (authority, source, version, taxid)

yes

>
> Since taxid is optional, what if one server leaves it out;
> are the two still the same?

no - because if a taxid is specified that is a restriction for one 
organism. no taxid means that  this refers to multiple organisms.


> I decided to solve it with a unique identifier.

that might be good. this identifier could also be used to restrict 
searches on servers with many coordinate systems.

>
> Let's say that
>    http://das.sanger.ac.uk/registry/coordinates/ABC123
> is the identifier for:
>    authority=NCBI
>    version=v22
>    taxid=9606
>    source=Chromosome
>    created=2006-03-14T07:27:49

fine


> Then the following are equivalent.  The only difference is the
> number of properties defined in the COORDINATES tag.
>
> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES
> uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" />
> </CAPABILITY>
>
>
> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES 
> uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
>        source="Chromosome"/>
> </CAPABILITY>
>
>
> <CAPABILITY type="segments"
>       query_uri="http://localhost/das2/h.sapiens/v22/segments">
>    <COORDINATES 
> uri="http://das.sanger.ac.uk/registry/coordinates/ABC123"
>       source="Chromosome" authority="NCBI" version="v22" taxid="9606"
>       created="2006-03-14T07:27:49" />
> </CAPABILITY>

o.k.

This is a lot of change to the spec for us being already on the second 
code sprint,
but I think it makes things clearer

Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From dalke at dalkescientific.com  Tue Mar 14 20:46:27 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 12:46:27 -0800
Subject: [DAS2] description and title
Message-ID: <84c508c1625b5507dd511c8d1ef0f682@dalkescientific.com>

Andreas' DAS registry has a description for each versioned source.
See http://das.sanger.ac.uk/registry/listServices.jsp .

Here's an example of what's in it

     Machine learning approach used SWISSPROT variants annotated as
     disease/neutral as training dataset. Predictions made on all
     ENSEMBL nscSNPs as to their disease status

I've added an optional 'description' field to the versioned source
record for servers that wish to provide that information.


Allen's types response had 'name' and 'description' attributes.
These were not in the types record.  I've added 'description' and
added 'title'.

I've been using 'title' for short descriptions; a few words long.
I've been using 'description' for plain text up to a paragraph.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 00:34:55 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 14 Mar 2006 16:34:55 -0800
Subject: [DAS2] updated examples
Message-ID: <dd454f0a2558990438f7b58e779406b0@dalkescientific.com>

Checked into das CVS.

das/das2/draft3/

The current (incomplete) spec is 'spec.txt'.  It is already out of date.
The .rnc files are up-to-date.

The subdirectory "ucla" contains data from Allen's server,
with the format hand-updated.

A couple of things to note.  I used three different ways of
specifying the same namespace:

<SOURCES xmlns="http://www.biodas.org/ns/das/genome/2.00">

<DAS2:FEATURES xmlns:DAS2="http://www.biodas.org/ns/das/genome/2.00">

<das:SEGMENTS xmlns:das="http://www.biodas.org/ns/das/genome/2.00">


This is to check that you all are doing correct namespace processing.  
:)

Also, I've gone ahead and added the 'SUPPORTS' element, like this

       <CAPABILITY type="features" query_uri="yeast/features.xml">
         <SUPPORTS name="basic" />
       </CAPABILITY>

This says that the server only supports 'basic' searches, which means
you can only ask it for all the feature.  There is no feature query 
language.
There is also 'das2queries' which says that the server supports the
das2 query language.  The following says that you can ask for everything
or you can ask for things in the DAS2 query language.

       <CAPABILITY type="features" query_uri="yeast/features.xml">
         <SUPPORTS name="basic" />
         <SUPPORTS name="das2queries" />
       </CAPABILITY>

If not given the client should assume it supports 'das2queries'.
Note that 'basic' is a subset of 'das2queries'.


					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Wed Mar 15 10:46:41 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 10:46:41 +0000
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
Message-ID: <200603151046.43196.lstein@cshl.edu>

Hi Folks,

I just ran through the source request on biopackages.net and it is returning 
something that is very different from the current spec (CVS updated as of 
this morning UK time). I understand why there is a discrepancy, but for the 
purposes of the code sprint, should I code to what the spec says or to what 
biopackages.net returns? It is much more fun for me to code to a working 
server because I have the opportunity to watch my code run.

Best,

Lincoln

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Wed Mar 15 10:39:35 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 10:39:35 +0000
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
Message-ID: <200603151039.36405.lstein@cshl.edu>

Hi Folks,

Shouldn't the prefix to das2 requests be http://server/blahblah/das2   ?

It would make it easier for clients to load the correct parsing code and would 
avoid the client having to make a round-trip to the server just to determine 
whether it is dealing with a das/1 or das/2 server.

My apologies if this has already been discussed.

Lincoln

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Wed Mar 15 14:32:26 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 06:32:26 -0800
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <200603151046.43196.lstein@cshl.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
Message-ID: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com>

Lincoln:
> I just ran through the source request on biopackages.net and it is  
> returning
> something that is very different from the current spec (CVS updated as  
> of
> this morning UK time).

The server isn't synched with any specific version of the spec. For
example, if I make a features request from

http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/ 
0:1000")

I get

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DAS2FEATURE SYSTEM  
"http://www.biodas.org/dtd/das2feature.dtd">
<FEATURELIST
   xmlns="http://www.biodas.org/ns/das/2.00"
   xmlns:xlink="http://www.w3.org/1999/xlink"
   xml:base="http://das.biopackages.net/das/genome/yeast/S228C/feature">
</FEATURELIST>

As from the discussion a few weeks ago we shouldn't be using the
   standalone="no"
since that says the document cannot be understood without consulting
the DTD, which doesn't exist.  And I don't want a DTD.

Also, the namespace needs to be  
"http://www.biodas.org/ns/das/genome/2.00"
(It's missing the 'genome') and the 'FEATURELIST' was replaced with
'FEATURES' a year ago.

In the types request

<?xml version="1.0" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="/xsl/das-genome-type.xsl"?>
<!DOCTYPE DAS2TYPES SYSTEM "http://www.biodas.org/dtd/das2types.dtd">
<!--
      xmlns="http://www.biodas.org/ns/das/genome/2.00"
-->
<TYPES
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xml:base="http://das.biopackages.net/das/genome/yeast/S228C/type/">
   <TYPE id="SO:ARS" ontology="/das/ontology/obo/1/ontology/SO/0000436"  
name="ARS" definition="A sequence that can autonomously replicate, as a  
plasmid, when transformed into a bacterial host.">


the commented out namespace declaration needs to there, and the type
id 'SO:ARS' needs to be escaped as it's treated as an identifier  
resolved
with the "SO" protocol.  Plus, until yesterday I didn't know about the
'name' or 'definition' attributes.  These are now in the schema as
'title' and 'description'.

There are a few other differences, like problems in the taxid and
empty strings for timestamps.  I hand-updated examples from Allen's
server yesterday, in cvs under das/das2/draft3/ucla .  I found some
of these during the update, though others I pointed out about a
year ago.

Allen doesn't want to update the server until the spec is stable,
for two reasons.  First, he doesn't like the churn of doing work only
to have to make more changes.  Second, you're not the only one who says

>  It is much more fun for me to code to a working
> server because I have the opportunity to watch my code run.

and Allen's setup doesn't have the ability to implement two versions
at the same time.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 14:46:39 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 06:46:39 -0800
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <200603151039.36405.lstein@cshl.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
Message-ID: <d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>

> Shouldn't the prefix to das2 requests be http://server/blahblah/das2   
> ?
>
> It would make it easier for clients to load the correct parsing code 
> and would
> avoid the client having to make a round-trip to the server just to 
> determine
> whether it is dealing with a das/1 or das/2 server.

It doesn't need the round-trip.  It can look at the Content-Type to
figure that out.

Plus, few of the DAS1 servers follow the DAS1 naming scheme.  Here's
a list from Andreas' registry server.

genome.cbs.dtu.dk:9000/das/tmhmm/
genome.cbs.dtu.dk:9000/das/netoglyc/
das.ensembl.org/das/ens_sc1_ygpm/
atgc.lirmm.fr/cgi-bin/das/MethDB/
smart.embl.de/smart/das/smart/
supfam.org/SUPERFAMILY/cgi-bin/das/up/
mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/

All of them do have the substring '/das/' somewhere, but not
at the start/end of the string.

Now, the content-type might be "application/xml" and not sufficient
to disambiguate between the two documents, but in that case you can
dispatch based on the root element type.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 15:05:52 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:05:52 -0800
Subject: [DAS2] XML namespaces
Message-ID: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>

I mentioned this yesterday but am doing it again as its own email.
This is a quick tutorial on XML namespaces.

The DAS spec uses XML namespaces.  XML didn't start with namespaces.
They were added later.  Older parsers, like SAX 1.0, did not understand
namespaces.  Newer ones, like SAX 2.0, do.

By default a document does not have a namespace.  For example,

<person name="Andrew" />

has no namespace.

To declare a default namespace use the 'xmlns' attribute.  All
attributes which start 'xml' or are in the 'xml:' namespace are
reserved.

<person name="Andrew" xmlns="http://www.biodas.org/" />

This is the name 'person' in the namespace 'http://www.biodas.org/'.
The namespace is an opaque identifer.  It leverages URIs in part
because it's much easier to guarantee uniqueness.

The combination of (namespace, tag name) is unique.  The tag
name is also called the "local name".

That's to distinguish it from a "qualified name", also called
a "qname".  These look like

<abc:person name="Andrew" xmlns:abc="http://www.biodas.org/" />

This element has identical meaning to the previous element
using the default namespace.  It's qname is 'abc:person' but
the full name is the tuple of

    ("http://www.biodas.org/", "person")

For notational convenience this is sometimes written in Clark
notation, as
   {http://www.biodas.org}person

   Element                                     Clark notation
<person />                                      person
<person xmlns="" />                             {}person
                            ("empty namespace" is different than "no  
namespace")

<person xmlns="http://biodas.org/" />            
{http://biodas.org/}person
<das:person xmlns:das="http://biodas.org/" />    
{http://biodas.org/}person
<X:person xmlns:X="http://biodas.org/" />        
{http://biodas.org/}person

The prefix used doesn't matter.  Only the combination of
   (namespace, local name)
is important.  The Clark notation string captures that as a single  
string,
which is much easier when doing comparisons.

For example, if you try the dasypus verifier at
    
http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/ 
genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features

one of the output messages is

Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES'  
but
got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line  
3, column 2

This shows the Clark name for the elements, indicating that the root
element has a different namespace and local name from what Dasypus  
expects.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 15:15:40 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:15:40 -0800
Subject: [DAS2] xml namespaces
Message-ID: <fee5e61fc425e4257406e81fba814a66@dalkescientific.com>

related to the previous email.  The spec uses the namespace

    http://www.biodas.org/ns/das/genome/2.00

I propose using a smaller and simpler URL.

The content does not matter to XML processors.  The practice though
is to use a URI which is resolvable for more information about the
element. For example,

      xmlns:xlink="http://www.w3.org/1999/xlink"

Go to that and the response is

> This is an XML namespace defined in the XML Linking Language (XLink) 
> specification.
>
> For more information about XML, please refer to The Extensible Markup 
> Language (XML) 1.0 specification. For more information about XML 
> namespaces, please refer to the Namespaces in XML specification.


Similarly the XML namespace URI is
   http://www.w3.org/1999/xhtml
XSLT is
   http://www.w3.org/1999/XSL/Transform

FOAF is
   http://xmlns.com/foaf/0.1/
which points to the actual documentation

I like the last approach and propose that DAS2 use the namespace

      http://biodas.org/documents/das2/


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 15:22:14 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:22:14 -0800
Subject: [DAS2] xml namespaces
In-Reply-To: <fee5e61fc425e4257406e81fba814a66@dalkescientific.com>
References: <fee5e61fc425e4257406e81fba814a66@dalkescientific.com>
Message-ID: <b8e904b97e7bd99c3cecbbc80df85a43@dalkescientific.com>

Me:
> I propose using a smaller and simpler URL.
  ...
> I like the last approach and propose that DAS2 use the namespace
>
>      http://biodas.org/documents/das2/

But it's such a minor point that not changing it is fine with me.

On the other hand, Allen's server doesn't given the right namespace
and Gregg's client currently ignores the namespace, so there isn't
any extra work.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 15:29:56 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:29:56 -0800
Subject: [DAS2] search by segment id
Message-ID: <712b5b29c53161455f3d9d1b34768937@dalkescientific.com>

One thing I came up with yesterday when moving from local identifiers
to URIs for the segment names.  There are two possible identifiers
for a given segment

<das:SEGMENTS xmlns:das="http://www.biodas.org/ns/das/genome/2.00"
        xml:base="http://localhost/das2/">
  <das:SEGMENT uri="segment/chr1" title="Chromosome 1" length="246127941"
    synonym="http://dalkescientific.com/human35v1/chr1"
    doc_href="http://www.ensembl.org/Homo_sapiens/mapview?chr=1" />


The local name is "http://localhost/das2/segment/chr1" while
the well-known global name (of which the local name is an alias) is
"http://dalkescientific.com/human35v1/chr1"

The global name can be anything.  It can be "urn:lsid:chr1" or
anything else.  It only needs to be unique across all identifiers.

Now, are range queries done with the local name or the global one?

That is,   
features?segment=http://localhost/das2/segment/chr1&range=100:200

      or  
features?segment=http://dalkescientific.com/human35v1/chr1&range=100: 
200
    ( or features?segment=urn:lsid:chr1&range=100:200 if that was the  
uri)


If it's the local name then the client must first query all servers
to get the mapping from global name to local name, and perform the
translation itself.

I propose that the client can query using the global name, and not
need to do the mapping to the local name.  In addition, a server
may support both names in the query, since by using URIs we guarantee
there are no accidental id collisions.

					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Wed Mar 15 15:34:06 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Wed, 15 Mar 2006 15:34:06 +0000
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
Message-ID: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk>

>
> genome.cbs.dtu.dk:9000/das/tmhmm/
> genome.cbs.dtu.dk:9000/das/netoglyc/
> das.ensembl.org/das/ens_sc1_ygpm/
> atgc.lirmm.fr/cgi-bin/das/MethDB/
> smart.embl.de/smart/das/smart/
> supfam.org/SUPERFAMILY/cgi-bin/das/up/
> mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/

all these servers match to the DAS 1 spec which says that the second to 
last bit
is "das" and the last bit is the "data source name".
The registry contains a check for that.

Andreas

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From td2 at sanger.ac.uk  Wed Mar 15 15:16:25 2006
From: td2 at sanger.ac.uk (Thomas Down)
Date: Wed, 15 Mar 2006 15:16:25 +0000
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
Message-ID: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk>


On 15 Mar 2006, at 14:46, Andrew Dalke wrote:

> Plus, few of the DAS1 servers follow the DAS1 naming scheme.  Here's
> a list from Andreas' registry server.
>
> genome.cbs.dtu.dk:9000/das/tmhmm/
> genome.cbs.dtu.dk:9000/das/netoglyc/
> das.ensembl.org/das/ens_sc1_ygpm/
> atgc.lirmm.fr/cgi-bin/das/MethDB/
> smart.embl.de/smart/das/smart/
> supfam.org/SUPERFAMILY/cgi-bin/das/up/
> mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/

These all look fine to me -- but they're URLs for individual data  
sources, rather than complete server installations.  Remove the last  
element and you'll get a server URL (e.g. genome.cbs.dtu.dk:9000/ 
das/) which ends /das/ in all cases.

The registry records datasources, not server installations.  In  
general, I'm not sure a server installation is a terribly  
"interesting" object, since it's quite possible that one server  
installation will host many datasources with little or no semantic  
connection between them -- the only thing they have in common is that  
they're hosted at the same site.

          Thomas.


From lstein at cshl.edu  Wed Mar 15 15:41:46 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 15:41:46 +0000
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
	<4d86b8f899632c8cd506297938fffd8a@dalkescientific.com>
Message-ID: <200603151541.47538.lstein@cshl.edu>

I'll use your hand-edited examples for testing.

Lincoln

On Wednesday 15 March 2006 14:32, Andrew Dalke wrote:
> Lincoln:
> > I just ran through the source request on biopackages.net and it is
> > returning
> > something that is very different from the current spec (CVS updated as
> > of
> > this morning UK time).
>
> The server isn't synched with any specific version of the spec. For
> example, if I make a features request from
>
> http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/
> 0:1000")
>
> I get
>
> <?xml version="1.0" standalone="no"?>
> <!DOCTYPE DAS2FEATURE SYSTEM
> "http://www.biodas.org/dtd/das2feature.dtd">
> <FEATURELIST
>    xmlns="http://www.biodas.org/ns/das/2.00"
>    xmlns:xlink="http://www.w3.org/1999/xlink"
>    xml:base="http://das.biopackages.net/das/genome/yeast/S228C/feature">
> </FEATURELIST>
>
> As from the discussion a few weeks ago we shouldn't be using the
>    standalone="no"
> since that says the document cannot be understood without consulting
> the DTD, which doesn't exist.  And I don't want a DTD.
>
> Also, the namespace needs to be
> "http://www.biodas.org/ns/das/genome/2.00"
> (It's missing the 'genome') and the 'FEATURELIST' was replaced with
> 'FEATURES' a year ago.
>
> In the types request
>
> <?xml version="1.0" standalone="no"?>
> <?xml-stylesheet type="text/xsl" href="/xsl/das-genome-type.xsl"?>
> <!DOCTYPE DAS2TYPES SYSTEM "http://www.biodas.org/dtd/das2types.dtd">
> <!--
>       xmlns="http://www.biodas.org/ns/das/genome/2.00"
> -->
> <TYPES
>       xmlns:xlink="http://www.w3.org/1999/xlink"
>       xml:base="http://das.biopackages.net/das/genome/yeast/S228C/type/">
>    <TYPE id="SO:ARS" ontology="/das/ontology/obo/1/ontology/SO/0000436"
> name="ARS" definition="A sequence that can autonomously replicate, as a
> plasmid, when transformed into a bacterial host.">
>
>
> the commented out namespace declaration needs to there, and the type
> id 'SO:ARS' needs to be escaped as it's treated as an identifier
> resolved
> with the "SO" protocol.  Plus, until yesterday I didn't know about the
> 'name' or 'definition' attributes.  These are now in the schema as
> 'title' and 'description'.
>
> There are a few other differences, like problems in the taxid and
> empty strings for timestamps.  I hand-updated examples from Allen's
> server yesterday, in cvs under das/das2/draft3/ucla .  I found some
> of these during the update, though others I pointed out about a
> year ago.
>
> Allen doesn't want to update the server until the spec is stable,
> for two reasons.  First, he doesn't like the churn of doing work only
> to have to make more changes.  Second, you're not the only one who says
>
> >  It is much more fun for me to code to a working
> > server because I have the opportunity to watch my code run.
>
> and Allen's setup doesn't have the ability to implement two versions
> at the same time.
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Wed Mar 15 15:49:40 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Wed, 15 Mar 2006 15:49:40 +0000
Subject: [DAS2] XML namespaces
In-Reply-To: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>
References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>
Message-ID: <200603151549.41773.lstein@cshl.edu>

I have just finished adding XML namespace support to the early-version Perl 
DAS2 client. BTW, if a namespace tag is reused in an inner scope with a 
different 

	<das:name xmlns:das="http://foo.bar/das" />
		<das:first>Andrew</das:first>
		<das:middle xmlns:das="http://addresses.com/address/2.0">K.</das:middle>
                <das:last>Dalke</das:last>
          </das:name>

I put middle into namespace http://addresses.com/address/2.0 and put first and 
last into namespace http://foo.bar.das.

This is the correct scoping behavior, right?

Lincoln

On Wednesday 15 March 2006 15:05, Andrew Dalke wrote:
> I mentioned this yesterday but am doing it again as its own email.
> This is a quick tutorial on XML namespaces.
>
> The DAS spec uses XML namespaces.  XML didn't start with namespaces.
> They were added later.  Older parsers, like SAX 1.0, did not understand
> namespaces.  Newer ones, like SAX 2.0, do.
>
> By default a document does not have a namespace.  For example,
>
> <person name="Andrew" />
>
> has no namespace.
>
> To declare a default namespace use the 'xmlns' attribute.  All
> attributes which start 'xml' or are in the 'xml:' namespace are
> reserved.
>
> <person name="Andrew" xmlns="http://www.biodas.org/" />
>
> This is the name 'person' in the namespace 'http://www.biodas.org/'.
> The namespace is an opaque identifer.  It leverages URIs in part
> because it's much easier to guarantee uniqueness.
>
> The combination of (namespace, tag name) is unique.  The tag
> name is also called the "local name".
>
> That's to distinguish it from a "qualified name", also called
> a "qname".  These look like
>
> <abc:person name="Andrew" xmlns:abc="http://www.biodas.org/" />
>
> This element has identical meaning to the previous element
> using the default namespace.  It's qname is 'abc:person' but
> the full name is the tuple of
>
>     ("http://www.biodas.org/", "person")
>
> For notational convenience this is sometimes written in Clark
> notation, as
>    {http://www.biodas.org}person
>
>    Element                                     Clark notation
> <person />                                      person
> <person xmlns="" />                             {}person
>                             ("empty namespace" is different than "no
> namespace")
>
> <person xmlns="http://biodas.org/" />
> {http://biodas.org/}person
> <das:person xmlns:das="http://biodas.org/" />
> {http://biodas.org/}person
> <X:person xmlns:X="http://biodas.org/" />
> {http://biodas.org/}person
>
> The prefix used doesn't matter.  Only the combination of
>    (namespace, local name)
> is important.  The Clark notation string captures that as a single
> string,
> which is much easier when doing comparisons.
>
> For example, if you try the dasypus verifier at
>
> http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/
> genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features
>
> one of the output messages is
>
> Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES'
> but
> got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line
> 3, column 2
>
> This shows the Clark name for the elements, indicating that the root
> element has a different namespace and local name from what Dasypus
> expects.
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Wed Mar 15 15:53:11 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:53:11 -0800
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
	<9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk>
Message-ID: <0e5d03e0bc2f9ab791a891f058ca664b@dalkescientific.com>

Andreas (and Thomas)
>> genome.cbs.dtu.dk:9000/das/tmhmm/
>> genome.cbs.dtu.dk:9000/das/netoglyc/
> all these servers match to the DAS 1 spec which says that the second 
> to last bit
> is "das" and the last bit is the "data source name".
> The registry contains a check for that.

Ahh, right.  I misremembered and thought that "/das" had to
be immediately after the hostname.  Looking now there can be
an arbitrary prefix.

What I remembered was the servers at http://das.bcgsc.ca:8080/das
which don't have regular names.

Then again, they have nearly bit-rotted away.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 16:04:38 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 08:04:38 -0800
Subject: [DAS2] XML namespaces
In-Reply-To: <200603151549.41773.lstein@cshl.edu>
References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com>
	<200603151549.41773.lstein@cshl.edu>
Message-ID: <2de39a4a831f6a06c408bdf31ef2a41f@dalkescientific.com>

Linconl:
> BTW, if a namespace tag is reused in an inner scope with a
> different
>
> 	<das:name xmlns:das="http://foo.bar/das" />
> 		<das:first>Andrew</das:first>
> 		<das:middle 
> xmlns:das="http://addresses.com/address/2.0">K.</das:middle>
>                 <das:last>Dalke</das:last>
>           </das:name>
>
> I put middle into namespace http://addresses.com/address/2.0 and put 
> first and
> last into namespace http://foo.bar.das.
>
> This is the correct scoping behavior, right?

Yes.  I tested it with an XML process and it says the following is
equivalent (after fixing a typo).

<ns0:name xmlns:ns0="http://foo.bar/das">
   <ns0:first>Andrew</ns0:first>
   <ns1:middle 
xmlns:ns1="http://addresses.com/address/2.0">K.</ns1:middle>
   <ns0:last>Dalke</ns0:last>
</ns0:name>

BTW, it should be "P." :)

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 15:58:15 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 07:58:15 -0800
Subject: [DAS2] Shouldn't prefix be /das2?
In-Reply-To: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151039.36405.lstein@cshl.edu>
	<d5fd69cc63e73d1f4419622884b216a1@dalkescientific.com>
	<58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk>
Message-ID: <a3848d3aa72cdf82472f36c3ba8093ab@dalkescientific.com>

Thomas:
> The registry records datasources, not server installations.  In 
> general, I'm not sure a server installation is a terribly 
> "interesting" object, since it's quite possible that one server 
> installation will host many datasources with little or no semantic 
> connection between them -- the only thing they have in common is that 
> they're hosted at the same site.

I agree.

The only thing that's interesting about the server installation is
knowing who is in charge when it goes down.  :)

That's found from the MAINTAINER element at the <SOURCES> level of
the sources document.


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Wed Mar 15 16:37:51 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Wed, 15 Mar 2006 08:37:51 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day two, 14 Mar 2006
Message-ID: <C03D82DF.1D05A%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day two, 14 Mar 2006

$Id: das2-teleconf-2006-03-14.txt,v 1.1 2006/03/15 16:47:50 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E.
  Sanger: Andreas Prlic, Thomas Down
  Dalke Scientific: Andrew Dalke (at Affy)
  UC Berkeley: Nomi Harris (at Affy)
  UCLA: Allen Day (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Agenda:
----------
See Andrew's email. Here's a summary.

* segment ids
* coord systems and how to handle

[Gregg is out, Andrew is leading the teleconf.]

ap: ad proposed changes re: coords and capabilities i think is not
really needed. the question is do annotation servers need to provide to
link to reference servers back. If the link is apparent from, c

ad: summary: moving coord element inside capabilities element (one
part of 4 things mentioned). the reason: coords and capabilities are
tied together. They refer to the same thing. E.g., you need know which
of the segments are tied to which coords.

ap: annotation server does need to, it can find the reference server
by the coordinates.

ad: if you have local coords, and you want to point to a local server,
how do you specify that this segment corresponds to these coords.
ap: you should have a reference server that speaks the coords you want
to annotate.
td: if you have your own assembly you have your own coord system,
ad: yes, and i set up my own ref server for it.

ad: if I have mult coords, won't I have multiple segments? isn't there
a 1:1 relationship between coords and segments?
ap: I think many:many.... wait
td: each segment is a member of one coord system, a coord system
contains many segments.
ad: andreas has features, some annotated on scaffold, some annotated
on chromosome. So, you need the ability to have two segments provided
by server.
ap: coords should contain segment capabilities, i.e., the other way
around.

ad: proposing to have a uri to id the coords, capapbility should have
a field to say the coord uri is 'this'
mailed out the idea to have a unique identifier for coords.
keep them separate now, have the ability
sc: optional?
ad: yes only needed if you have mult coord systems.

ad: like features and feature type. segment is saying it's of that
type

ad: will add optional id to the capability, so that you can figure out
what the segments are.

in proposal this am,
1) timestamp to coord info (optional) -- use case: sort by most recent
coord system for a given build.
2) unique id for the coord (

ap: this will be useful for searches as well. can request only results
from a particular coord system. (see email discussion this am)
td: server alignment btwn human and mouse, you can say whether you are
referencing human or mouse just by specifying coord system.
ad: also two different human assemblies.

ap: I have to leave now.

Topic: Segment identifiers email

td: segment had a name and url form id so that feature server doesn't
have to give a concrete url for the seq of chrm22, nice for
lightweight server sans sequence. getting rid of ability to reference
sequence by name instead of url breaks this. You need a concrete url
if you just want to serve features on a sequence.
You end up having to rewrite urls rather than saying this feature is
attached to chr22 in xxx coord system.

ad: one thing gregg and I discussed, the fact that url is by itself an
opaque id, you have to resolve it someway, http, or something else
too. You can use any mechanism you want to turn the name you want.
ad: in segments list, if you have your own local copy. Your segments
section says my local copy is
td: you need a segments capability. I can't have a server that uses
only features capabilities.
ad: if you have your own segments.
if all your features are described using standard names/ids, no you
don't need a segments capability.
td: ok, my assembly is human build 35, and feature lives on chr22.
ad: yes. every place you see optional alias attribute link back to
primary id of segment, that id can be anything.
td: arbitrary string scoped by the coord system, which now has a uri
id string.
ad: yes. and it's also globally unique, not scoped just by coord
system .

td: I don't see what's wrong with ....
ad: we were discussing yesterday having diff names for the same
chromosome. chrI vs chr1.
td: that can be addressed using aliases
ad: alias of field provides a synonym table for what you map locally
to a global id. 
td: you're saying the global ids have to be universally unique even
when taken out of the coord system
ad: yes. feat server providing feats from two diff coord systems, you
need a way to distinguish one segment from another segment, in a
global sense.
td: I don't totally understand cases involving mult coord systems. How
do I find out which of three possible coord systems a given segment
came from?
ad:
td: all clones in embl system. could be a lot.
ad: your client will have to know how to look up the right one.
if you have one coord system that has all your clones, you have to do
the look up anyway to know where to display the features from the
various clones.
td: suppose looking for gene names: you get back a feature on clone
AL19823. I want to start from that feature and build a meaningful
display. So  I need to work out what coord system this feature lives
on. If my server speaks multiple coord systems, one for all embl
accessions and gi ids, I have to test for membership in the set.
My server could put the coord system id on each feature. Would be
optional for servers only attached to one coord system.

ad: right. Andreas also wants coord uri part of feature filter. Could
add it to the feature filter.
td: yes. give me all genes called xyz. Do you always want to limit to
one coord system?
ad: I see your point. Having to search

ad: New thing called title for humans to read.
Also proposed inside, overlaps, contains so they don't

td: to avoid a nastiness in query lang, I like that. Removes an issue
that scares me about having urls in the query. pathological case:
client has a good reason to retrieve features on part of a two
sequences that have lots of features on. e.g., all cutting sites for
all restriction enzymes. Very high density. If the genome is made of
10kb clones, the user may want to get features that span clone
boundaries. server may do lots of extra fetching that's not really
necessary. 
ad: it's the number of requests that's the issue, same amout of
info. so it's an issue of network overhead.

advantage: makes servers easier to implement since it eliminates
searching partial regions. Some use cases exists, but can be done on
the client side. 
td: seems a shame to lose the capability, but not a huge loss.
the alternative would be to say that you parse the query string left
to right. overlaps=5000-10000; ... puts limits on how server parses.
ad: or we propose a new query interface

ad: this sounds like I should go ahead with segment ids.

ad: using uri vs id (internal link id vs link to something else)
td: seems to be enough impl-breaking changes, not a big argument
either way.
ad: enough changes going on now, but probably won't change much more.
td: if you want to make a small change that's quick to implement, no
objections. Also fine with using id, since all dom stuff about id
refers to things marked id in the scheme, not attrib names. Changing
to uri, won't cause much effect.
nh: like a gobal replace.
ad: in general there's been lots of changes, want people to get
clients/servers going.
ad: spec writing is going slow, would like to show examples that
people can use.
nh: feature parsing can use canned examples.
aday: would prefer to have spec written, trouble with ambiguity
ad: you need to impl before you can figure out how to write it.
nh: server people need full spec, client can use examples

ad: previous slow going since lincoln had little time to work on it.
aday: would like a snapshot, version number. impl after last code
sprint.
nh: don't have time to work on das after this. will just break when/if
allen's server changes.
This just happens when working on developing spec.

ad: the idea is to get code and examples up today.
td: waiting for spec to stabilize a bit.
ad: changes made this week won't have major impact on people's work in
UK?
td: no.

nh: can you provide a changes document?
ad: those would be my emails. a pain.

nh: registry, I was suprised to find a versioned sources in it. won't
there be an explosion of org x versions x server. It provides
convenience
td: as long as it's not thousands and thousands of data sources, it
won't be a problem.
ad: 2k per server x 1000 servers, = 2M
td: if it gets to point where retrieving whole registry is a problem,
we could add capability to restrict what you get.
nh: need human-friendly title for each data source.
would be nice if that explained more to the person who was choosing
that data source (e.g., date).
ad: Andreas' system (web-based) has a description.


Status reports
--------------

sc: adding more data to affy das server, working on building
das2_server code recently checked into genoviz code base by
gregg. Then will work on setting it up on a publically accessible
server at affy.

ee: will be working on style sheets in igb.

aday: spent time on setting up dev environment since laptop died
yesterday. 

bo: got food poisoning -- bad pizza?, was up till 4am.

td: not much das-related stuff yet.


From Steve_Chervitz at affymetrix.com  Wed Mar 15 21:24:59 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Wed, 15 Mar 2006 13:24:59 -0800
Subject: [DAS2] New affymetrix das/2 development server
Message-ID: <C03DC62B.1D090%Steve_Chervitz@affymetrix.com>


Gregg's latest spec-compliant, but still development-grade, das/2 server is
now publically available via http://205.217.46.81:9091

It's currently serving annotations from the following assemblies:
- human hg16 
- human hg17 
- drosophila dm2

Send me requests for any other data sources that would help your development
efforts.

Example query to get back a das-source xml document:
http://205.217.46.81:9091/das2/genome/sequence

It's compliance with the spec is steadily improving, on a daily if not
hourly basis during the code sprint.

Within IGB you can access this server from the DAS/2 servers tab
under 'Affy-temp'.

You'll need the latest version of IGB from the CVS repository at
http://sf.net/projects/genoviz

Steve


From dalke at dalkescientific.com  Wed Mar 15 21:25:53 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 13:25:53 -0800
Subject: [DAS2] on local and global ids
Message-ID: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>

The discussion today was on local segment identifiers vs. global
segment identifiers.

I'm going to characterize them as "abstract" vs. "concrete"
identifiers.  An abstract id has no default resolution to a
resource.  A concrete one does.

The identifier "http://www.biodas.org/" is concrete identifier
because it has a default resolver. "lsid:ncbi:human:35" is an
abstract identifier because it has no default resolver (though
there are resolvers for lsid they are not default resolvers.)

The global segment identifier may be a concrete identifier.  It
may implement the segments interface.  But who is in charge of
that?  Who defines and maintains the service?  If it goes down,
(power outage, network cable cut) then what does the rest of
the world do?

For the purposes of DAS it is better (IMO) that the global
identifiers be abstract, though they should be http URLs which
are resolvable to something human readable.  (This is what
the XML namespace elements do.)

Reference servers are concrete identifiers.  They exist.  They
can change (eg, change technologies and change the URLs, say
from cgi-bin/*.pl to an in-process servlet.)  Now, they should
be long-lived, but that's not how life works.

Suppose someone wants to set up an annotation server, without
setting up a reference server.  One solution is to point to
an existing reference server.

<SOURCES>
  <SOURCE>
   <VERSION>
    <CAPABILITY type="segments" 
uri="http://some/remote/reference/server" />
    <CAPABILITY type="features" uri="features.cgi" />
    <CAPABILITY type="types" uri="types.xml" />
   </VERSION>
  </SOURCE>
</SOURCES>

In this case all the features are returned with segments labeled
as in the reference server.  There's no problem.

Second, Andreas wants an abstract "COORDINATE" space id

<SOURCES>
  <SOURCE>
   <VERSION>
    <COORDINATES uri="http://some/arbitrary/coordinate/id" 
authority="NCBI"
       version="35" .... />
    <CAPABILITY type="features" uri="features.cgi" />
    <CAPABILITY type="types" uri="types.xml" />
   </VERSION>
  </SOURCE>
</SOURCES>

This requires a more complicated client because it must have other
information to figure out how to convert from the coordinate identifier
into the corresponding types.

The answer that Andreas and others give is "consult the registry".
That is, look for other other segments CAPABILITY elements with
the same coordinates id.  For that to happen there needs to be a
way to associate a segments doc with a coordinate system.  For example,
this is what the current spec allows (almost - there's no example
of it and I'm still trying to get the schema working for it)

<SOURCES>
  <SOURCE>
   <VERSION>
    <COORDINATES uri="http://some/arbitrary/coordinate/id" 
authority="NCBI"
       version="35" .... />
    <CAPABILITY type="segments" uri="features.cgi"
             coordinates="http://some/arbitrary/coordinate/id" />
   </VERSION>
  </SOURCE>
</SOURCES>


This makes a resolution scheme from an abstract coordinate identifier
into a concrete segments document identifier.

Why are there so many fields on the coordinates?  It could be 
normalized,
so you fetch the coordinate id to get the information.  It's there
to support searches.  A goal has been that the top-level sources 
document
gives you everything you need to know about the system.

(Doesn't mean it's elegant.  I won't talk about alternatives.  It's
not important.  There's at most an extra 150 or so bytes per versioned
source.)


The problem comes when a site wants a local reference server.
These segments have concrete local names.

DAS1 experience suggests that people almost always set up local
servers.  They do not refer to an well-known server.

There are good reasons for doing this.  If the local annotation
server works then the local reference server is almost certain
to work.  The well-known server might not work.

Also, the configuration data is in the sources document.  There's
no need to set up a registry server to resolve coordinates.  There's
no configuration needed in the client to point to the appropriate
concrete identifier given an abstract URL.

My own experience has been that people do not read specifications.
I am an odd-ball.  According to

    http://diveintomark.org/archives/2004/08/16/specs

I am an asshole.  That's okay -- most people are morons.

> Morons, on the other hand, don?t read specs until someone yells at 
> them. Instead, they take a few examples that they find ?in the wild? 
> and write code that seems to work based on their limited sample. Soon 
> after they ship, they inevitably get yelled at because their product 
> is nowhere near conforming to the part of the spec that someone else 
> happens to be using. Someone points them to the sentence in the spec 
> that clearly spells out how horribly broken their software is, and 
> they fix it.

Someone who wants to implement a DAS reference server will
take the data from somewhere and make up a local naming scheme.

That's what happened with DAS1.  That's why Gregg was saying
he maintains a synonym table saying human
    1 = chr1 = Chromo1 = ChrI
    2 = chr2 = Chromo2 = ChrII

This will not change.  People will write a server for local data
and point a DAS client at it.  The client had better just work
for the simple case of viewing the data even through there is
no coordinate system -- it needs to, because people will work on
systems with no coordinate system.

Sites will even write multiple in-house DAS servers providing
data, which work because everything refers to the same in-house
reference server.

It's only the first time that someone wants to merge in-house
data with external data that there's a problem.  This might be
several months after setting up the server.  At that point they
do NOT want to rewrite all the in-house servers to switch to
a new naming scheme.

That's why the primary key for a paired annotation server and
feature must be a local name.  That's what morons will use.
Few will consult some global registry to make things interoperable
at the start.

> For example, some people posit the existence of what I will call the 
> ?angel? developer. ?Angels? read specs closely, write code, and then 
> thoroughly test it against the accompanying test suite before shipping 
> their product. Angels do not actually exist, but they are a useful 
> fiction to make spec writers to feel better about themselves.

Lincoln could come up with universal names for every coordinate
system that ever existed or will exist.  But people will not
consult it.

However, they will when there is a need to do that.  The need comes
in when they want to import external data.  At that point they need
a way to join between two different data sources.

They consult the spec and see that there's a "synonym" (or "reference",
or "global", or "master" or *whatever* name -- I went with synonym
because it doesn't imply that it's the better name.)


  <SEGMENT uri="segment/chrI" title="Chromosome I" length="230209"
      synonym="http://dalkescientific.com/yeast1/ChrI" />

The local name <xml:base> + "segment/ChrI" is also known as
http://dalkescientific.com/yeast1/ChrI .  Simple, and requires
very little change in the server code.

The only other change is to support the synonym name when
doing segment requests, as
   segment=http://dalkescientific.com/yeast1/ChrI

This is important because then clients can make range requests
from servers without having to download the segment document first.
It's also easy to implement, because it's a lookup table in the
web server interface, and not something which needs to be in
the database proper.

Most people are morons.  The spec as-is is written for that.
It's not written for angels.  It allows post-facto patch-ups once
people realize they need a globally recognized name.

It does require smarter clients.  They need to map from local
name to global name, through a translation table provided by
the server.  This is fast and easy to implement.  It's easier
to implement than consulting multiple registry servers and
trying to figure out which is appropriate.

And the XML returned will be smaller.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Wed Mar 15 22:39:36 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 14:39:36 -0800
Subject: [DAS2] xml namespace uri
Message-ID: <f5afcfa23b8826ee042fa81e5c4bc57a@dalkescientific.com>

Please use

   "http://biodas.org/documents/das2"

for the XML element namespace.

The two current servers (Allen's and Steve's) use

   "http://www.biodas.org/ns/das/2.00"

which is wrong according to the spec, for the last 2 years it's been

   "http://www.biodas.org/ns/das/genome/2.00"

Since the servers need to change anyway, might as well make it
something a bit more readable, and shorter.  :)

I've checked all the current dasypus (validator) software into
CVS, btw, and updated all of the example xml (draft3/ucla/) to
use the new namespace.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 05:17:24 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Wed, 15 Mar 2006 21:17:24 -0800
Subject: [DAS2] query language description
Message-ID: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>

The query fields are

   name      |  takes | matches features ...
  ==========================
   xid       |  URI   | which have the given xid
   type      |  URI   | with the given type or subtype (XX keep this  
one???)
   exacttype |  URI   | with exactly the given type
   segment   |  URI   | on the given segment
   overlaps  | region | which overlap the given region
   inside    | region | which are contained inside the given region (XX  
needed??)
   contains  | region | which contain the given region  (XX needed?? )
   name      | string | with a name or alias which matches the given  
string
   prop-*    | string | with the property "*" matching the given string

Queries are form-urlencoded requests.  For example, if the features
query URL is 'http://biodas.org/features' and there is a segment named
'http://ncbi.org/human/Chr1' then the following is a request for all the
features on the first 10,000 bases of that segment

The query is for
     segment = 'http://ncbi.org/human/Chr1'
     overlaps = 0:10000

which is form-urlencoded as

    
http://biodas.org/features? 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000

Multiple search terms with the same key are OR'ed together.  The  
following
searches for features containing the name or alias of either
BC048328 or BC015400

   http://biodas.org/features?name=BC048328;name=BC015400

Multiple search terms with different keys are AND'ed together,
but only after doing the OR search for each set of search terms with
identical keys.  The following searches for features which have
a name or alias of BC048328 or BC015400 and which are on the segment
http://ncbi.org/human/Chr1

    
http://biodas.org/features?name=BC048328; 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400

The order of the search terms in the query string does not affect
the results.

If any part of a complex feature (that is, one with parents
or parts) matches a search term then all of the parents and
parts are returned.  (XXX Gregg -- is this correct? XXX)


The fields which take URLs require exact matches.

I think we decided that there is no type inferencing done in
the server; it's a client side thing.  In that case the 'type'
field goes away.  We can still keep 'exacttype'.  The URI
used for the matching is the type uri, and NOT the ontology URI.

(We don't have an ontology URI yet, and when we do we can add
an 'ontology' query.)

The segment URI must accept the local identifier.  For
interoperability with other servers they must also accept the
equivalent global identifier, if there is one.

If range searches are given then one and only one segment is
allowed.  Multiple segments may be given, but then ranges are not
allowed.

The string searches support a simple search language.
     ABC  -- contains a word which exactly matches "ABC" (identity, not  
substring)
    *ABC  -- words ending in "ABC"
     ABC* -- words starting with "ABC"
    *ABC* -- words containing the substring "ABC"

If you want a field which exactly contains a '*' you're kinda
out of luck.  The interpretation of whitespace in the query or
in the search string is implementation dependent.  For that
matter, the meaning of "word" is implementation dependent.  (Is
*O'Malley* one word? *Lethbridge-Stewart*?)

When we looked into this last month at Sanger we verified that
all the databases could handle %substring% searches, which was
all that people there wanted.  The Affy people want searches for
exact word, prefix and suffix matches, as supported by the the
back-end databases.


   XXX CORRECT ME XXX

The 'name' search searches.... It used to search the 'name'
attribute and the 'alias' fields.  There is no 'name' now.  I
moved it to 'title'.  I think I did the wrong thing; it should
be 'name', but it's a name meant for people, not computers.

Some features (sub-parts) don't have human-readable names so
this field must be optional.


The "prop-*" is a search of the <PROP> elements.  Features may
have properties, like

    <PROP key="cellular_component" value="membrane" />

To do a string search for all 'membrane' cellular components,
construct the query key by taking  the string "prop-" and
appending the property key text ("cellular_component").  The
query value is the text to search for.

     prop-cellular_component=membrane

To search for any cellular_component containing the substring "mem"

     prop-cellular_component=*membrane*

The rules for multiple searches with the same key also apply to the
prop-* searches.  To search for all 'membrane' or 'nuclear'
cellular components, use two 'prop-cellular_component' terms, as

      
http://biodas.org/features?prop-cellular_component=membrane;prop- 
cellular_component=membrane


The range searches are defined with explicit start and end
coordinates.  The range syntax is in the form "start:end", for
example, "1:9".

Let 'min' be the smallest coordinate for a feature on a given
segment and 'max' be one larger than the largest coordinate.
These are the lower and upper founds for the feature.

An 'overlaps' search matches if and only if
    min < end AND max > start

XXX For GREG XXX

What do 'inside' and 'contains' do?  Can't we just get
away with 'excludes', which has complement of 'overlaps'?
Searches are done as:
   Step 0) specify the segment
   Step 1) do all the includes  (if none, match all features on segment)
   Step 2) do all the excludes, inverted (like an includes search)
   Step 3) only return features which are in Step 1 but not
       in Step 2)
   Step 4) ...
   Step 5) Profit!

I think this will support your smart code, and it's easy
enough to implement.

Every one but you was planning to use 'overlaps'.  Only you
wanted to use 'inside'.  Anyone want to use 'contains'?

					Andrew
					dalke at dalkescientific.com


From td2 at sanger.ac.uk  Thu Mar 16 09:24:03 2006
From: td2 at sanger.ac.uk (Thomas Down)
Date: Thu, 16 Mar 2006 09:24:03 +0000
Subject: [DAS2] on local and global ids
In-Reply-To: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
Message-ID: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>


On 15 Mar 2006, at 21:25, Andrew Dalke wrote:
>
> The problem comes when a site wants a local reference server.
> These segments have concrete local names.
>
> DAS1 experience suggests that people almost always set up local
> servers.  They do not refer to an well-known server.

I'm not sure that DAS1 experience is a good model for this.  It's  
true that people didn't always point to well-known reference servers,  
but I think this has more to do with the fact that people didn't know  
which server to point to.  Some people did set up their own reference  
servers.  Many didn't, and many of those didn't give a valid  
MAPMASTER URL at all.  This situation didn't actually cause too much  
trouble since a lot of these users just wanted to add a track to  
Ensembl -- which doesn't care about MAPMASTER URLs and just trusts  
the user to add tracks that live in an appropriate coordinate system.

I'd still argue that the majority -- probably the vast majority -- of  
people setting up DAS servers really just want to make an assertion  
like "I'm annotating build NCBI35 of the human genome" and be done  
with it.  That's what the coordinate system stuff in DAS/2 is for.   
If this is documented properly I don't think we'll see many "end- 
user" sites setting up their own reference servers unless a) they  
want an internal mirror of a well-known server purely for performance/ 
bandwidth reasons or b) they want to annotate an unpublished/new/ 
whatever genome assembly.

(Actually, some of the "annotation providers set up their own  
reference servers" stuff might be my fault -- early versions of  
Dazzle were pretty strict about requiring a valid [and functional!]  
MAPMASTER for every datasource, so this pushed people towards setting  
up reference servers.)

                 Thomas.


From lstein at cshl.edu  Thu Mar 16 11:03:49 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu, 16 Mar 2006 11:03:49 +0000
Subject: [DAS2] on local and global ids
In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
Message-ID: <200603161103.50323.lstein@cshl.edu>

I think it will help considerably to have a document that lists the valid 
sequence IDs for popular annotation targets. I've spoken with Ewan on this, 
and Ensembl will generate a list of IDs for all vertebrate builds. I'll take 
responsibility for creating IDs for budding yeast, two nematodes and 12 
flies.

Lincoln

On Thursday 16 March 2006 09:24, Thomas Down wrote:
> On 15 Mar 2006, at 21:25, Andrew Dalke wrote:
> > The problem comes when a site wants a local reference server.
> > These segments have concrete local names.
> >
> > DAS1 experience suggests that people almost always set up local
> > servers.  They do not refer to an well-known server.
>
> I'm not sure that DAS1 experience is a good model for this.  It's
> true that people didn't always point to well-known reference servers,
> but I think this has more to do with the fact that people didn't know
> which server to point to.  Some people did set up their own reference
> servers.  Many didn't, and many of those didn't give a valid
> MAPMASTER URL at all.  This situation didn't actually cause too much
> trouble since a lot of these users just wanted to add a track to
> Ensembl -- which doesn't care about MAPMASTER URLs and just trusts
> the user to add tracks that live in an appropriate coordinate system.
>
> I'd still argue that the majority -- probably the vast majority -- of
> people setting up DAS servers really just want to make an assertion
> like "I'm annotating build NCBI35 of the human genome" and be done
> with it.  That's what the coordinate system stuff in DAS/2 is for.
> If this is documented properly I don't think we'll see many "end-
> user" sites setting up their own reference servers unless a) they
> want an internal mirror of a well-known server purely for performance/
> bandwidth reasons or b) they want to annotate an unpublished/new/
> whatever genome assembly.
>
> (Actually, some of the "annotation providers set up their own
> reference servers" stuff might be my fault -- early versions of
> Dazzle were pretty strict about requiring a valid [and functional!]
> MAPMASTER for every datasource, so this pushed people towards setting
> up reference servers.)
>
>                  Thomas.
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Thu Mar 16 11:06:38 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu, 16 Mar 2006 11:06:38 +0000
Subject: [DAS2] Spec freeze
In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
Message-ID: <200603161106.39074.lstein@cshl.edu>

Hi,

I just spoke with Thomas and Andreas on this, and all three of us are experiencing difficulty coding to a changing spec. In my opinion the spec is really good right now and issues such as whether to use "uri" or "id" as attribute names are not germaine. Can I propose that we declare a three month spec freeze starting at midnight tonight (GMT)?

Lincoln


-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Thu Mar 16 15:38:00 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 07:38:00 -0800
Subject: [DAS2] on local and global ids
In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
Message-ID: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>

Thomas:
> I'm not sure that DAS1 experience is a good model for this.  It's true 
> that people didn't always point to well-known reference servers, but I 
> think this has more to do with the fact that people didn't know which 
> server to point to.

I think I said there are two cases; there's actually several

  1. the sources document states a well-known COORDINATES
       and makes no links to segments
  2. the sources document refers to a well-known segments server
       ("the" reference server) and no COORDINATES
  3. the source document has a segments document, and each segment
      listed uses URIs from "the" reference server
  4. the server implements its own coordinates server, with
      new segment ids
  5. When uploading a track to Ensembl there's no need to have
      either COORDINATE or segments -- the upload server can
      verify for itself that the upload uses the right ids.


The *only* concern is with #4.  Everything else uses the well-known
global identifier for segments.

> I'd still argue that the majority -- probably the vast majority -- of 
> people setting up DAS servers really just want to make an assertion 
> like "I'm annotating build NCBI35 of the human genome" and be done 
> with it.

I'm fine with that.  There are two ways to do it.  #1 and #2 above.
In theory only one of those is needed.   The document can point to
"the" reference server for NCBI 35.

In practice that's not sufficient because there is no authoritative
NCBI 35 server.

Hence COORDINATES provides an abstract global identifier describing
the reference server.

>   That's what the coordinate system stuff in DAS/2 is for.  If this is 
> documented properly I don't think we'll see many "end-user" sites 
> setting up their own reference servers unless a) they want an internal 
> mirror of a well-known server purely for performance/bandwidth reasons 
> or b) they want to annotate an unpublished/new/whatever genome 
> assembly.

A philosophical comment.  I'm a distributed, self-organizing kinda
guy.  I don't think single root centralized systems work well when
there are many different groups involved.

I think many people will use the registry server, but not all.
I think there will be public DAS servers which aren't in the registry.
I know there will be in-house DAS servers which aren't.

I'm just about certain that some sites will have local copies of
the primary data.  They do for GenBank, for PDB, for SWISS-PROT,
for EnsEMBL.  Why not for DAS?

That said, here's a couple of questions for you to answer:

   a) When connecting to a new versioned source containing only
COORDINATES data, what should the client do to get the list
of segments, sizes, and primary sequence?

I can think of several answers.  My answer is that the versioned
source should state the preferred reference server and unless
otherwise configured a client should use that reference server
and only that reference server.

Yes, all the reference servers for that coordinate system
are supposed to return the same results.  But that's only if
they are available.  There are performance issues too, like
low bandwidth or hosting the server on a slow machine.  The
DAS client shouldn't round-robin through the list until it
finds one which works because that could take several minutes
to timeout on a single server, with another 10 to try.

Yes, a client can be configured and told "for coordinate
system A use reference server Z".  But that's a user
configuration.

   b) If there is a local mirror of some reference server, how
should the local DAS clients be made aware of it?  (And
should this be a supportable configuration? I think so.)

I'm pretty sure that most DAS clients won't be configurable
to look for local servers instead of global ones.  Even if
they are, I'm pretty sure each will have a different way
to do so.  Apollo and Bioperl will use different mechanisms.

I have no good answer for this.  It sounds like your answer
is "people won't have local copies."  I think they will.

Ideas:
   - have a rewriting registry server which does a rewrite of
the information from the other servers.  But this doesn't
work because the feature result from the remote server (in
my scheme) is given using its local segment names.  There's
no way to go from that local name to the appropriate mirror
reference server.  This suggests that the results really do
need to be given through global ids, with no support for
local ones.  The segments result optionally provides a way
to resolve a global name through a local resource.

   - set up an HTTP proxy service for DAS requests which
transparently detects, translates and redirects to the
appropriate local resource.  Cute, but not likely to be
done in real life.

   c) A group has been working on a new genome/assembly.  The
data is annotated on local machines using DAS and DAS writeback
Finally it's published.  Do they need to rewrite all their
segment identifiers to use the newly defined global ones?

As there are only a few places where the segment identifier is
used, and it's an interface layer, I think the conversion is
easy.  But it is a flag day event which means people don't
want to do it.  Instead, it's more likely that local people
will set up a synonym table to help with the conversion.

There are perhaps a dozen groups which might do this and they
all have competent people.  This should not be a problem.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 16:06:26 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 08:06:26 -0800
Subject: [DAS2] on local and global ids
In-Reply-To: <200603161103.50323.lstein@cshl.edu>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>
	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
	<200603161103.50323.lstein@cshl.edu>
Message-ID: <f35df2e5087a0027594a6dd2cd1d9a28@dalkescientific.com>

Lincoln:
> I think it will help considerably to have a document that lists the 
> valid
> sequence IDs for popular annotation targets. I've spoken with Ewan on 
> this,
> and Ensembl will generate a list of IDs for all vertebrate builds. 
> I'll take
> responsibility for creating IDs for budding yeast, two nematodes and 12
> flies.

What should people use if there aren't defined?  Like now?

If everyone must use the same well-defined global id for the features
response then doesn't that mean we can't have any DAS servers
until this document is made?

Is the general requirement that the first person to make a server for
a given build/genome/etc. is the one who gets to define the
global ids?  Or is it Andreas at Sanger who defines the names?

Suppose one group in California starts defining names for, say,
the barley genome.  Another group in say, Germany, is also working
on the barley genome.  They hate each others guts and don't work
together, so they make their own names.  The names refer to the
same thing because it was a group in Japan which produced the
genome.  Do we wait for an alignment service?  An identity service?
before people can merge data from these two groups?

Maybe we can solve all this by having an identity mapper format.
And defer defining that format until there is a problem.

There is no perfect solution.  This is a sociological problem.

Gregg's current client, I think, used hard-coded knowledge about the
mapping between the two current servers.  Then again, his code
already supports a synonym table.


					Andrew
					dalke at dalkescientific.com


From gilmanb at pantherinformatics.com  Thu Mar 16 15:52:51 2006
From: gilmanb at pantherinformatics.com (Brian Gilman)
Date: Thu, 16 Mar 2006 10:52:51 -0500
Subject: [DAS2] on local and global ids
In-Reply-To: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
	<41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>
Message-ID: <441989D3.90202@pantherinformatics.com>

Hey Guys,

    Where's the latest spec and use case document? Sorry if this is a 
super dumb question. I couldn't find it on the website.

                               Best,

                                        -B

Andrew Dalke wrote:

>Thomas:
>  
>
>>I'm not sure that DAS1 experience is a good model for this.  It's true 
>>that people didn't always point to well-known reference servers, but I 
>>think this has more to do with the fact that people didn't know which 
>>server to point to.
>>    
>>
>
>I think I said there are two cases; there's actually several
>
>  1. the sources document states a well-known COORDINATES
>       and makes no links to segments
>  2. the sources document refers to a well-known segments server
>       ("the" reference server) and no COORDINATES
>  3. the source document has a segments document, and each segment
>      listed uses URIs from "the" reference server
>  4. the server implements its own coordinates server, with
>      new segment ids
>  5. When uploading a track to Ensembl there's no need to have
>      either COORDINATE or segments -- the upload server can
>      verify for itself that the upload uses the right ids.
>
>
>The *only* concern is with #4.  Everything else uses the well-known
>global identifier for segments.
>
>  
>
>>I'd still argue that the majority -- probably the vast majority -- of 
>>people setting up DAS servers really just want to make an assertion 
>>like "I'm annotating build NCBI35 of the human genome" and be done 
>>with it.
>>    
>>
>
>I'm fine with that.  There are two ways to do it.  #1 and #2 above.
>In theory only one of those is needed.   The document can point to
>"the" reference server for NCBI 35.
>
>In practice that's not sufficient because there is no authoritative
>NCBI 35 server.
>
>Hence COORDINATES provides an abstract global identifier describing
>the reference server.
>
>  
>
>>  That's what the coordinate system stuff in DAS/2 is for.  If this is 
>>documented properly I don't think we'll see many "end-user" sites 
>>setting up their own reference servers unless a) they want an internal 
>>mirror of a well-known server purely for performance/bandwidth reasons 
>>or b) they want to annotate an unpublished/new/whatever genome 
>>assembly.
>>    
>>
>
>A philosophical comment.  I'm a distributed, self-organizing kinda
>guy.  I don't think single root centralized systems work well when
>there are many different groups involved.
>
>I think many people will use the registry server, but not all.
>I think there will be public DAS servers which aren't in the registry.
>I know there will be in-house DAS servers which aren't.
>
>I'm just about certain that some sites will have local copies of
>the primary data.  They do for GenBank, for PDB, for SWISS-PROT,
>for EnsEMBL.  Why not for DAS?
>
>That said, here's a couple of questions for you to answer:
>
>   a) When connecting to a new versioned source containing only
>COORDINATES data, what should the client do to get the list
>of segments, sizes, and primary sequence?
>
>I can think of several answers.  My answer is that the versioned
>source should state the preferred reference server and unless
>otherwise configured a client should use that reference server
>and only that reference server.
>
>Yes, all the reference servers for that coordinate system
>are supposed to return the same results.  But that's only if
>they are available.  There are performance issues too, like
>low bandwidth or hosting the server on a slow machine.  The
>DAS client shouldn't round-robin through the list until it
>finds one which works because that could take several minutes
>to timeout on a single server, with another 10 to try.
>
>Yes, a client can be configured and told "for coordinate
>system A use reference server Z".  But that's a user
>configuration.
>
>   b) If there is a local mirror of some reference server, how
>should the local DAS clients be made aware of it?  (And
>should this be a supportable configuration? I think so.)
>
>I'm pretty sure that most DAS clients won't be configurable
>to look for local servers instead of global ones.  Even if
>they are, I'm pretty sure each will have a different way
>to do so.  Apollo and Bioperl will use different mechanisms.
>
>I have no good answer for this.  It sounds like your answer
>is "people won't have local copies."  I think they will.
>
>Ideas:
>   - have a rewriting registry server which does a rewrite of
>the information from the other servers.  But this doesn't
>work because the feature result from the remote server (in
>my scheme) is given using its local segment names.  There's
>no way to go from that local name to the appropriate mirror
>reference server.  This suggests that the results really do
>need to be given through global ids, with no support for
>local ones.  The segments result optionally provides a way
>to resolve a global name through a local resource.
>
>   - set up an HTTP proxy service for DAS requests which
>transparently detects, translates and redirects to the
>appropriate local resource.  Cute, but not likely to be
>done in real life.
>
>   c) A group has been working on a new genome/assembly.  The
>data is annotated on local machines using DAS and DAS writeback
>Finally it's published.  Do they need to rewrite all their
>segment identifiers to use the newly defined global ones?
>
>As there are only a few places where the segment identifier is
>used, and it's an interface layer, I think the conversion is
>easy.  But it is a flag day event which means people don't
>want to do it.  Instead, it's more likely that local people
>will set up a synonym table to help with the conversion.
>
>There are perhaps a dozen groups which might do this and they
>all have competent people.  This should not be a problem.
>
>					Andrew
>					dalke at dalkescientific.com
>
>_______________________________________________
>DAS2 mailing list
>DAS2 at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/das2
>
>
>  
>


From dalke at dalkescientific.com  Thu Mar 16 16:33:58 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 08:33:58 -0800
Subject: [DAS2] on local and global ids
In-Reply-To: <441989D3.90202@pantherinformatics.com>
References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com>	<9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk>
	<41d2d7197710e14d4ba898ae758bf280@dalkescientific.com>
	<441989D3.90202@pantherinformatics.com>
Message-ID: <24b985c0229970562a9e2612f00f2da5@dalkescientific.com>

Brian:
>    Where's the latest spec and use case document? Sorry if this is a 
> super dumb question. I couldn't find it on the website.

CVS for the spec.  The history is:

draft 1 - written by Lincoln, freeze for summer last year.
This is the one with HTML, etc. and is on the web site.

draft 2 - written by me in January.  In CVS under das/das2/new_spec.txt
with examples under das/das2/scratch . This was the version
for the spring last month

draft 3 - under development
I rewrote beginning of it because no one liked the pedantic
pedagogical style it used.  This draft starts with examples.
The incomplete version, as of Monday morning, is 
das/das2/draft3/spec.txt
However, I am slow at writing spec text, especially new text.
Instead of working on it more I put example output files in
  das/das2/draft3/ucla/
starting with 'sources.xml' in that directory.

As for use cases, the email you saw from me a couple of days
ago is the only thing even close to formal.

					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Thu Mar 16 17:05:10 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Thu, 16 Mar 2006 17:05:10 +0000
Subject: [DAS2] sources responses
Message-ID: <355af8b441fefe8690a9e78de55fc2f9@sanger.ac.uk>

Hi!

the (toy) sources responses at

http://www.spice-3d.org/dasregistry/das1/sources/
http://www.spice-3d.org/dasregistry/das2/sources/

now are updated to the latest spec and validate with Andrew's validator 
at
http://cgi.biodas.org:8080/

Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Steve_Chervitz at affymetrix.com  Thu Mar 16 20:37:16 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 16 Mar 2006 12:37:16 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day three, 15 Mar 2006
Message-ID: <C03F0C7C.1D0F7%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day three, 15 Mar 2006

$Id: das2-teleconf-2006-03-15.txt,v 1.1 2006/03/16 20:45:35 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Sanger: Thomas Down, Andreas Prlic
  CSHL: Lincoln Stein
  Dalke Scientific: Andrew Dalke (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


[Notetaker: joining 10 min into the discussion]

ls: how does synonym business work?
ad: if server has access to data...
ls: we ask server for the global id, uses same global id for segments,
and uses same global id for the sequence.
gh: to do this in the capabilities for annot server, the global id for
segments query points to reference server.
ls: if the local machine current server, has sequence capabilities,
then it passes global id for segments to current server and it gets
the sequence. if it doesn't have that capability, then we need to
figure out a way for it to get the sequence. the easiest way to do
that would be to resolve that url and fetch it. I'm open to any
suggestion. I don't see how this uri/synonym is getting us any closer
to being able to find the server where sequence can be fetched. The
synonym isn't always a fetchable thing.
ad: syn is a global id
ad: look at the uri for the segment and fetch it from there
ls: could be a remote url.
gh: segments query is only thing that gives segment url
segments capabilities for the annot server should point

ls: break apart segments into: id=a string, then have an attribute
seq_url, when fetched returns the seq. returns the bases.
ad: is that's what's there already?
ls: no, uri is an id
ad: every url is an id, but it's up to whim of the server
ls: i don't want people to think its for an id.
want an agreed upon uri identifier, then optionally have a url.
turn synonym into uri, turn uri into resolver
make uri required, bases not required.
ad: additional constraint is 'agreed upon'. what about a group
starts a new sequencing project. There is no globally known uri for it
yet.
ls: they just create their own ids
td: the natural authority is the creator of the assembly.
gh: ncbi won't do it. they don't have a das server, unlikely to.
ls: can point to genome assembly. can create a url that will return
bases from ncbi in a supported format.
this approach will disentangle issue of resolvable vs non-resolvable,
local vs non-local segment ids and how to get segment dna.
gh: I think this will work.

ad: 'this' changing key names?
ls: key semantics
uri is required, global identifier sequence is an optional pointer
gh: you say that for feat xml, the id for seq will be the globally
agreed on id.
ls: yes
ad: if you don't have a local copy, if you have ability to map global
identifiers, then you know what it is from the coordinates.
there are two ways to specificy coordinates: coordinates and segments

ad: if you just need the segments and some identifier.
only when you need to do an overlay with someone else that you
need the coords.
gh: no, coords don't say anything about ids of coord (?)

gh: if we do it the way lincoln proposed, then the logical way to
relate those is that the segments capapbilities points to ref server.
ad: when feat returns a location is it in global or local space?
gh: lincoln - global space

ls: every annot server will know length of its landmarks (chrms).
some people will not want to be served dna, they will point somewhere
else where to get the dna. There will be many places to get dna for a
given global id, they chose one they like.
ls: feature locations are given in global id
ad: this changes the way it's been working. xml:base issues
ls: I know.
gh: if base of sequence and base of features are different, the xml
will get bigger.

ls: so an argument for having local ids is so you can make location
string shorter.
gh: yes.
ls: probably not worth it
ad: also makes it easier to set up a basic server. if you want to
overlay them, yes you do.
ls: you can always set up a local server if you

gh: segments response local and global id as we talked about yesterday
(which one feature locatn is relative to)
gh: if the only way to overlay for a client to know things are in the
same coord system is segid=xxxx and globalid=yyyy, how much harder is
it for server to use global ids.
ls: server can have configuration file to know where its global ids
are coming from

aday: would need to think about it more.
ad: who will set up these identifiers (yeast, human)
ls: I'll do it for model org databases, I will specify segments, and
their dna fetchers and will look up their lengths.
gh: versions?
ls: most recent. community can then keep it up to date.
I bet ensembl will be happy to generate this file automatically with
every build (for vertebrates)

ad: local id uri, and a bunch of synonyms. People will set up own
server not referencing a global system.
ls: then client would do a closure over all systems.
imagine three servers:
server-a says here is my segment
server-b says it can be  b or c
server-c says it can be c or a
so you have to do a join over all servers

gh: not encourage people to do that with local seq ids, encourage
people to use.
need a global referencing system to say this uri is same as that uri.
ad: bad logic for the web. If one is wrong, could be a problem
td: (proposal - based on genomic coord alignments)
ad: that says only alignable things are the same.

ad: don't think it will work, they will already have local servers

gh: what about 'the stick': people who want to register their server
with central registry can only do so if they use global ids for their
segments. 
ls, td: fine
ad: if they've been working for a while in house, they would have a
big effort to retrofit their system to comply. just won't do.

ls: in draft 3, where's assembly info?
ad: same as before. ask segments for agp format. draft not complete.
gh: the thing that ids which assembly you're on is the coordinates
element (authority, taxonomy, ...)
ls: authority is a recognized, globally unique organization. Should it
be a uri?
ad: authority and version is human visible so people can search by
it.
ls: fine.

gh: can invoke the 'stick' idea here: if you 're trying to register
something on same genomome assembly, then registry can check your
segments to verify they are agreed up.
ls: taxon, source, authority, version all must match
ad: also an id
ap: we discussed in email
ad: the only stuff that is complete is in the ucla subdir.
ls: the examples are definitive
ad: yes, unless we change things today.

ls: what if taxon, source, version match but uri doesn't?
registry gets submission. makes a segments request on submitter, if it
gets a list of same segment identifiers, it accepts it. what if it
gets a subset?
gh: ok
ls: superset is not ok.
aday: why?
gh: if you allow subset and superset, you can have everything.
aday: use case: bacteria with extra plasmid identifier.

nh: signing off. will be at affy tomorrow.

ls: you would have to create your own coord system.
gh: could argue with maintainer to added it.
ls: can you have multiple coordinates in a given assembly?
aday: proposal: make coords an attribute of the segment.
could keep your segment references local.

ls: we shouldn't give people ways to create new names. human chr1 ncbi
build 35 should be something that everybody can agree on.
gh: then we wouldn't allow allen's use case where someone wants a
superset of what's in reference?
ls: add new coord tag to source version entry, says I'm creating a
superset consisting of coords from ref 1, 2, 3, any of these can be a
new namespace that I set up.
gh: how do you know which ones come from where?
right now there's now way to get coord for a segment.
ad: can as of yesterday afternoon.

ls: to indicate which segments come from which auth. put coord id into
segments tag. 
aday: thank you!
ad: alternative proposal - multiple segments
use case: when you have scaffolds or chromosomes, or mouse and yeast
ls: say you want human mouse scaffolds + chrms, and human chrms
three diff coords tags in the sources document
each one gives auth, taxon, etc.
when client goes to get segments, it will get human chromosomes, mouse
chrms, and mouse scaffolds, in one big list, each will point back to
coord it got in features requets.

gh: knowing what coordinates doesn't tell you global id for segment
aday: ok.
gh: multiple segments elements vs mult coords in a segment work for
me.
ad: what does a client do
gh: ...
ls: three types of entry points, hu chrms, mo chrms, mo scaffolds, now
tell me what you want to start browsing. human readable.
scaffold on mouse with name xxx from two

ad: displaying all together vs one or the other or the other.

ee: affymetrix use case in igb. [probe

gh: doesn't seem to matter
aday: the tag values are easier to implement
td: not a big difference to me
gh: drawing on whiteboard...

ls: let's rename das to distributed annotation research network. then
we can say "darn1, darn2"!

ad: gregg's request for search to find everything identical (start and
end are same)
td: if you have contained and inside, you can do identical with an and
operation.
ls: doesn't make server any more complicated, for completeness you may
want to do that.
ad: how about includes 1-5000 and excludes ... some of this is asethetic.
ls: overlaps, contains, contained-in have good use cases for.
exact match - maybe searching for curated exons that exactly match
predicted. 

[Lincoln has to leave.]

gh: drawing options for segments and coordinate systems.
[whether you  put a coords tag per segment, or ome capabilities one
for each coord system]
allen's approach - one query with filter or multiple fetches

aday: uniprot example
gh: separate segments query.
ap: can we leave it out and add later if necessary?
ad: these are things that haven't been discussed in last two years
aday: uri

ad: xml namespace issue - what do we call it (see email)
gh: you pick it

ad: required syntax for entry points /das/source
gh: recommended, but not required
ad: lincoln was only one who felt strongly about it being required,
and he's not here.


gh: feature xml, every feature can have multiple locations
feaures can represent alignments (collapsed alignment tag into feature
tag)
td: like it
gh: naive user- given a feat with multip location on genome, represent as
multip locations, or parent child relations
td: don't see as a problem. using parent-child you have things to say
about child features specific to them
gh: genscan prediction,
a problem: one server can serve them up as parent child or as multiple
locations on parent

four child exons in one case
four diff locations in other case

problem is with feat filters. if yo do an overlaps query and any
children meet the condition, you have to return the parent as well and
it's parent on up. agreed?
ad: yes
gh: works fine for parent child, but for multip location situation, if
inside query fully contains only two eons, do you return parent?

td: I'd assume inside query would return both. as long as one exon is
inside the region, the parent is return. define inside as applying to
any level.
gh: so even though the transcript is not inside, you still return it?
td: using the get parent-if-get-children rule
gh: rule must apply to all of them, so you don't get transcript since
it doesn't meet the inside condition.

aday: multiple locations makes sense - just aligned mult times.
human alu feature 100,000s, do you want to create a single feature, or
just a single identifier and put it in many different locations.
ee: that is for alignments not parent-child relationship
aday: you consider location as a attribute of the object..
ee: I agree. alu is only one object, but the exon-transcript are
different
ad: would someone want to annotate the separate exons differently?
aday: you would split it off
ad: eg blast alignment, hsp is part of the conceptual alignment.
gh: in bioperl, some people will go one path, some go the other path,
so we need to figure out how to deal with it.

feat filters is clear for parent child relationship.
aday: inside and overlaps
gh: if your overlap query only grazes one child, you return the
parent. this is the only one I'm certain about.
gh: we haven't specified that the child is within bounds of parent.
with insides, we have a difference of opinion.

one exon is within, do you return it?
ad: most clients  will be doing overlaps, you are the only one doing
insides what do you want?
gh: the multiple locations muddies the issue.
if parent child rule is you only return it if parent is inside (and
recursive parent), I've already optimized for that.
For multiple locations, I can catch that and handle it.
the way I want, the behaviour of mult location will be diff than
parent child.
td: for me, the overlaps is the most important thing. Andreas just get
everything.
ad: can we delegate to gregg here for what to do in case of inside.

[A] gregg will write up description for inside query and multiple locations

Status reports
-----------------

gh: updating server. overlaps, insides, types, and each
good news: latest genome assembly on human on affy server overlayed
with allen's server. using hardcoded knowledge in igb for assembly id,
not coordinates yet.
with andrew: making sure clients can understand any variants of
namespace usage in the xml.
get client to use more capabilities like links

ad: example data set together, updated schema to latest spec, but
forgot cigar thing. update validator to use most recent version or rnc
schemas.
gh: even if your server isn't public you can cut and paste into you
validator at http://cgi.biodas.org:8080

aday: biopackages up to date with version 200 of spec file. issues for
nomi, and gregg. off by one error.

bo: small code refactor in the das server. testing that today.

ee: nothing das related yet, but will. implementing style sheets to get
colors for features.

ap: registry ui for upload of a das/2 source. coding for that

gh: what about registry rejecting segment ids if they don't match
standard ids for that coord system. sound good to you?
ap: basically yes. 
td: not done a great deal

gh: Nomi has been here working on apollo client. we'll hear from her
tomorrow. 

-----------------------
post teleconf discussion re: using global identifiers for uri

[Notetaker: just a few morsels were captured here.]

ad: most folks i work with get something going locally, then after
it's going, hook it up with the rest of the world, integrate with
other people. they don't want to revamp their work in order to do
that. 

gh: slightly in favor with andrew

ad: get what we have now. they are still uri's so it's just an
interpretation. will change attributes to be 'uri and 'reference_uri'

gh: how does it get length of segments?


ad: good idea to have coordinates and segments in the document.
add your own track to ensembl, you don't need to give it a segments,
just specify coordinates.
gh: seems like it will encourage servers that can only work with
particular clients.

ad: what about getting rid of coordinates, just needed by Andreas for
registry. 


From Steve_Chervitz at affymetrix.com  Thu Mar 16 20:38:13 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 16 Mar 2006 12:38:13 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day four, 16 Mar 2006
Message-ID: <C03F0CB5.1D0F8%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day four, 16 Mar 2006

$Id: das2-teleconf-2006-03-16.txt,v 1.1 2006/03/16 20:45:48 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Gregg Helt
  CSHL: Lincoln Stein
  Dalke Scientific: Andrew Dalke (at Affy)
  Sanger: Andreas Prlic
  UC Berkeley: Nomi Harris (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Status reports
---------------

nh: apollo work, reading the registry, saving
capabilties. modifications to code that was based on prototype das
adaptor. Generally lots of under the hood work to bring it up to spec.

bo: diff functionality between allen's server biopackages.net server
and andrew's samepl xml. Updated templates in allen's das server to
match andrew's sample xml.

ad: worked on validation server, all stuff is in cvs. the
http://cgi.openbio.org:8080 server is built off cvs, just check out
and rebuild. 

gh: worked on affy das2 server and client up to current spec based on
whatever the rnc documents say (schema doc) as for xml. no chance to read
andrew's email on query syntax, will incorporate that today.

sc: got latest version of gregg's das/2 server up at affy. serving
hg17, hg16, dm2. Updated code that the das1 server is using based on
latest genoviz jars. Getting some errors when loading data for new
affy arrays. Investigating.

aday: minor bug fixes for spec v200. exporting assay data as different
views.
ucsc browser can viz expression data out of das server in bed format.
das viewer can view as egr format. working on single chip at a time.

ls: here's a great use case for you: there's a cshl fellow creating dna
spectrographs of oligo frequencies presented as audiographs. can really
tell diffs from coding vs non-coding, CpG triplets, microsatellites
harmonics, big matrices of floating point data tied to genome.
consider this a challenge to das to serve this up.
my postdoc sheldon mckay is serving this up give you heatmap back
given a genomic region. new glyph for spectrographic data

aday: format netCDF is good for this, but clients out there don't
vizualize it.
gh: would like to support netCDF in igb. not sure if this is default
way to represent qualtitative data for das.

[A] allen will send lincoln pointer to netCDF.

aday: netCDF is great for cross-lang, cross platform support.
gh: people are pushing wiggle format to ucsc, so we don't want to
restrict to just netCDF.

aday: my refactor yesterday allows treatment of these as templates.
gh: how do this via region query in das?
ls: feature query, tag says here comes binary data, each column
corresponds to a base (or maybe a scaling factor to indicate # of bp
per column). 
tag says here comes binary qualtitatilve data, scale is 1:1.
gh: better way is to use alternative content format stuff (already in
spec for types)
ls: if you do feat request and don't filter by type, you'll get a mix
of binary and non binary.
aday: not in genome domain, genome/sequence the fetch to assay service
to get quant data. then do intersection to find overlap.
performance goes out window if you make the query too complex.
fine to do just two fetches.

ls: how indicate scale for numerical scale?
aday: good question. units are not encoded now.
ls: spectogrphic data one value per window where window is 100 bp
aday: so two diff units
window size, amplitude value and frequency, and that's in four
channels for the bases. we're representing as 4 matrices.
aday: one matrix per channel.many formats don't support n-dimensional
data. only 2d at most.
ls: in das1 did base64 encoded string in the notes. It worked.
gh: we can't require all clients to know how to interpret it.
This is why we have the alt content functionality...

[A] das should support dense numeric data across regions, format specified
by the existing alternative format mechanism

Topic: Spec Freeze
-------------------

ls: can we talk about feezing spec?
ad: what good will it do?
ls: allow us to code to a fixed spec. you freeze spec, people write
code for a defined period of time, during that time we compare notes,
then make changes, freeze, and repeat.
ad: concerned there hasn't been enough work since the changes in jan/feb.
ls: now that i'm 'on the other side of the fence' of spec writing,
i'd like to see it not change, and have time to make an informed view
of what it's strengths and weaknesses are.
ad: haven't gotten feedback about my questions, until the
codesprints. two months ago, only now being addressed.
ls: these issues don't become pressing until we start
implementing. this is why we do code sprints.
ad: worry because there's been no extensive data modeling for
features.
ls: can do a 1 month freeze
gh: comfortable with 1 mon freeze of schemas as they are in the rnc's
now. issues will come up.
ls: announce on biodas.org - march 18th das/2 is frozen for 1 month.
gh: we'll have to live to ambiguity with how server does certain
things.
ls: hence the time limited 'trial' freeze.
ad: would have like people to write code from last feb so I could get
feedback.
ls: you very much improved the spec. grateful for what you've done. I
wasn't getting feedback when I was writing either.
gh: validation website is great for implementers, rather than having
to read a spec document everyday.
ad: schemas aren't going to change after today (pm). would like to
clear some things up about filter language, today?
ls: most urgent freeze

[A] spec will freeze as of end of today (3/16/06, PST) for one month.

Topic: Feature filters
----------------------

ad: feature filters is most important, and how do we define global
names? schema is a simple change - which is req'd and which is
optional but for impls makes a big diff.
ls: global is req'd and local is optional.
ad: who comes up with global names
ls: first person to do it has naming rights.
people have been able to do it for the ensembl service.
ad: I need documented names
gh: it means you don't know whether two names are the same thing until
this document comes out.

ls: filter language?
ad: gregg needs inside and contains,
- type and exact type: das type or ontology type?
ls: das type
gh: uri attribute of the type
ad: that type or it's subtype makes no sense for das types
ls: it's just an exact match. client can use ontology to get a series
of types
ls: should be an exact match, does not traverse ontology.
client should ask user: do you want all exons or a specific type of
exon? 
ls: client goes through ontology as necesary

[A] drop exacttype, type now has exacttype semantics

Topic: XID, feature ids
------------------------

ad: xid in features. no one used yet. gives a ref to some other
db. all it is is a url/uri. feels like there should be more info
(type?)
ad: primary name field for feature, feels like should be name
ls: name is human readable. title would be ok
ad: but feature filter is called name searches name and id fields
ls: this is correct behavior, you can do a fetch on the url/uri
this is ok.
ad: the name feature searches title and alias.

gh: if feature id is resolvable and you resolve it, there's no
guarantee it gives back a das2xml document.
if the feature uri is resolvable, and you fetch it, you will get back
a das2xml document right?
can you put uri in the feature query?
aday: feels that having auto-generated names
ad: do all features have a human readable name?
gh/ls: optional
ad: why would you want to put a url in a name field?
gh: rdf
ad: should be a resolvable resource, das2xml for that feature.

ad: features with aliases, do aliases need type pk or accession?
prosite has false match to ...
ls: this is a property or xid, not alias
ad: suggests that xid needs extra stuff to it.
gh: file with an optional type attribute on xid
ad: let's wait to someone has a need.

Topic: Feature filters (continued)
----------------------------------

gh: feature filters, inside, contains, identical. Which do we need,
which can we drop?

[A] overlaps - keep (all agree)

inside - gregg needs
contains - dropping, maybe
identical - dropping

ad: what about excludes - the complement of overlap?
gh: haven't had time to investigate whether I can use excludes rather
than the inside + overlaps (contains?) combination I need now.

ls: use case: pointing to children and they haven't arrived yet.
gh: my client keeps stuff around, when you get parent/child if you
have parent + all children you can construct feature.
ls: the spec requires single parent, right?
gh: no you can have multiple.
ls: gff3 spec also allows mult parent and children

[A] Lincoln will provide use cases/examples of these features scenarios:
- three or greater hierarchy features
- multiple parents
- alignments


Topic: Registry 
----------------

ap: still here.
gh: looking at registry, having trouble retrieving in a normal
browser. when looking at it in client, I only see biopackages server
registered as a server. Lincoln said there was more?
ap: this is related to mime types, changed from text plain to
x-das-sources 
gh: I get an error: source file could not be red.
lincoln said you added other test das2 servers to it.
ap: working on interface so users can upload servers.
half way through it now. upload a link to sources.
will send email once it's there.

[A] Steve will add gregg's new affy das/2 server to registry when Andreas'
web interface is ready

gh: same time tomorrow.


From cjm at fruitfly.org  Thu Mar 16 20:50:37 2006
From: cjm at fruitfly.org (chris mungall)
Date: Thu, 16 Mar 2006 12:50:37 -0800
Subject: [DAS2] query language description
In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
Message-ID: <e3ad320577f173bc0234ca4df6d16645@fruitfly.org>

Hi Andrew

I presume one constraint is that you want to preserve standard CGI URL 
syntax? I think this is the best that can be done using that 
constraint, which is to say, fairly limited. This lacks one of the most 
important features of a real query language, composability. These 
ad-hoc constraint syntaxes have their uses but you'll eventually want 
to go beyond the limits and end up adding awkward extensions. Why not 
just forego the URL constraint and go with a composable extendable 
query language in the first place and save a lot of bother downstream?

On Mar 15, 2006, at 9:17 PM, Andrew Dalke wrote:

> The query fields are
>
>    name      |  takes | matches features ...
>   ==========================
>    xid       |  URI   | which have the given xid
>    type      |  URI   | with the given type or subtype (XX keep this
> one???)
>    exacttype |  URI   | with exactly the given type
>    segment   |  URI   | on the given segment
>    overlaps  | region | which overlap the given region
>    inside    | region | which are contained inside the given region (XX
> needed??)
>    contains  | region | which contain the given region  (XX needed?? )
>    name      | string | with a name or alias which matches the given
> string
>    prop-*    | string | with the property "*" matching the given string
>
> Queries are form-urlencoded requests.  For example, if the features
> query URL is 'http://biodas.org/features' and there is a segment named
> 'http://ncbi.org/human/Chr1' then the following is a request for all 
> the
> features on the first 10,000 bases of that segment
>
> The query is for
>      segment = 'http://ncbi.org/human/Chr1'
>      overlaps = 0:10000
>
> which is form-urlencoded as
>
>
> http://biodas.org/features?
> segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000
>
> Multiple search terms with the same key are OR'ed together.  The
> following
> searches for features containing the name or alias of either
> BC048328 or BC015400
>
>    http://biodas.org/features?name=BC048328;name=BC015400
>
> Multiple search terms with different keys are AND'ed together,
> but only after doing the OR search for each set of search terms with
> identical keys.  The following searches for features which have
> a name or alias of BC048328 or BC015400 and which are on the segment
> http://ncbi.org/human/Chr1
>
>
> http://biodas.org/features?name=BC048328;
> segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400
>
> The order of the search terms in the query string does not affect
> the results.
>
> If any part of a complex feature (that is, one with parents
> or parts) matches a search term then all of the parents and
> parts are returned.  (XXX Gregg -- is this correct? XXX)
>
>
> The fields which take URLs require exact matches.
>
> I think we decided that there is no type inferencing done in
> the server; it's a client side thing.  In that case the 'type'
> field goes away.  We can still keep 'exacttype'.  The URI
> used for the matching is the type uri, and NOT the ontology URI.
>
> (We don't have an ontology URI yet, and when we do we can add
> an 'ontology' query.)
>
> The segment URI must accept the local identifier.  For
> interoperability with other servers they must also accept the
> equivalent global identifier, if there is one.
>
> If range searches are given then one and only one segment is
> allowed.  Multiple segments may be given, but then ranges are not
> allowed.
>
> The string searches support a simple search language.
>      ABC  -- contains a word which exactly matches "ABC" (identity, not
> substring)
>     *ABC  -- words ending in "ABC"
>      ABC* -- words starting with "ABC"
>     *ABC* -- words containing the substring "ABC"
>
> If you want a field which exactly contains a '*' you're kinda
> out of luck.  The interpretation of whitespace in the query or
> in the search string is implementation dependent.  For that
> matter, the meaning of "word" is implementation dependent.  (Is
> *O'Malley* one word? *Lethbridge-Stewart*?)
>
> When we looked into this last month at Sanger we verified that
> all the databases could handle %substring% searches, which was
> all that people there wanted.  The Affy people want searches for
> exact word, prefix and suffix matches, as supported by the the
> back-end databases.
>
>
>    XXX CORRECT ME XXX
>
> The 'name' search searches.... It used to search the 'name'
> attribute and the 'alias' fields.  There is no 'name' now.  I
> moved it to 'title'.  I think I did the wrong thing; it should
> be 'name', but it's a name meant for people, not computers.
>
> Some features (sub-parts) don't have human-readable names so
> this field must be optional.
>
>
> The "prop-*" is a search of the <PROP> elements.  Features may
> have properties, like
>
>     <PROP key="cellular_component" value="membrane" />
>
> To do a string search for all 'membrane' cellular components,
> construct the query key by taking  the string "prop-" and
> appending the property key text ("cellular_component").  The
> query value is the text to search for.
>
>      prop-cellular_component=membrane
>
> To search for any cellular_component containing the substring "mem"
>
>      prop-cellular_component=*membrane*
>
> The rules for multiple searches with the same key also apply to the
> prop-* searches.  To search for all 'membrane' or 'nuclear'
> cellular components, use two 'prop-cellular_component' terms, as
>
>
> http://biodas.org/features?prop-cellular_component=membrane;prop-
> cellular_component=membrane
>
>
> The range searches are defined with explicit start and end
> coordinates.  The range syntax is in the form "start:end", for
> example, "1:9".
>
> Let 'min' be the smallest coordinate for a feature on a given
> segment and 'max' be one larger than the largest coordinate.
> These are the lower and upper founds for the feature.
>
> An 'overlaps' search matches if and only if
>     min < end AND max > start
>
> XXX For GREG XXX
>
> What do 'inside' and 'contains' do?  Can't we just get
> away with 'excludes', which has complement of 'overlaps'?
> Searches are done as:
>    Step 0) specify the segment
>    Step 1) do all the includes  (if none, match all features on 
> segment)
>    Step 2) do all the excludes, inverted (like an includes search)
>    Step 3) only return features which are in Step 1 but not
>        in Step 2)
>    Step 4) ...
>    Step 5) Profit!
>
> I think this will support your smart code, and it's easy
> enough to implement.
>
> Every one but you was planning to use 'overlaps'.  Only you
> wanted to use 'inside'.  Anyone want to use 'contains'?
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Thu Mar 16 23:24:25 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 15:24:25 -0800
Subject: [DAS2] 'source' attribute in the types document
Message-ID: <c9f78d43e5f0267a571095bd0b735418@dalkescientific.com>

Types have a 'source' field.

The first draft shows examples like
   source='curated'
   source='genescan'
   source='tRNAscan-SE-1.11'

My interpretation is that this is a human readable field,
with no machine interpretation other than as a string.  It
does not come from a controlled vocabulary.  It may contain
spaces.

This field is not currently searchable because we expect the
number of types to be small enough a client will download
everything and do the search locally.

Let me know if I'm wrong.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 22:46:14 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 14:46:14 -0800
Subject: [DAS2] query language description
In-Reply-To: <e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
Message-ID: <e3e550c9fe3d623420739a36073d0891@dalkescientific.com>

Hi Chris,

> I presume one constraint is that you want to preserve standard CGI URL 
> syntax?

Yes.

>  I think this is the best that can be done using that constraint, 
> which is to say, fairly limited.

Then again, the functionality we need is also fairly limited.

>  This lacks one of the most important features of a real query 
> language, composability. These ad-hoc constraint syntaxes have their 
> uses but you'll eventually want to go beyond the limits and end up 
> adding awkward extensions. Why not just forego the URL constraint and 
> go with a composable extendable query language in the first place and 
> save a lot of bother downstream?

Because no one can decide on a generic language which is more
powerful than this.

Anything more powerful would need to support .. boolean algebra?
numeric searches?  regexps?  What about quoting rules for "multiple
word phrases"?

Is it SQL-like?  XPath/XQuery-like?  Is it a context-free grammar?
How easy is it to implement and work cross-platform?

For what people need now, this search solution seems good.

For the future we can have

   <CAPABILITY type="xpath-query" query_uri="http://whatever" />

and clients which understand that interface will know that it's
there.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 23:38:07 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 15:38:07 -0800
Subject: [DAS2] new search terms
Message-ID: <5a29cf88a8fc1e8e8448c6e1dd248dbb@dalkescientific.com>

"note=" is a string search of the note fields

   Example: note=And*
     finds all features where which have a note containing
     a word starting with 'And'

"coordinates=" filters for features on that coordinate system.
  (We talked about this one yesterday.)

I'm republish the search terms before the end of the day.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Mar 16 23:54:12 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 15:54:12 -0800
Subject: [DAS2] comments in schema
Message-ID: <e3731b48f7b05d690ec4fe1e873c482d@dalkescientific.com>

I've updated the schema docs (das/das2/draft3/*.rnc )
to include more detailed comments.

Also, updated the ucla examples to change 'synonym' to
'reference'.

Everything should be up to date.


					Andrew
					dalke at dalkescientific.com


From cjm at fruitfly.org  Fri Mar 17 00:04:03 2006
From: cjm at fruitfly.org (chris mungall)
Date: Thu, 16 Mar 2006 16:04:03 -0800
Subject: [DAS2] query language description
In-Reply-To: <e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
	<e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
Message-ID: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org>


On Mar 16, 2006, at 2:46 PM, Andrew Dalke wrote:

> Hi Chris,
>
>> I presume one constraint is that you want to preserve standard CGI URL
>> syntax?
>
> Yes.

I'm guessing you've been through this debate before, so no comment..

>
>>  I think this is the best that can be done using that constraint,
>> which is to say, fairly limited.
>
> Then again, the functionality we need is also fairly limited.

ignorant question.. (I have only been tangentially aware of the outer 
edges of the whole das2 process)..

how are you determining the functionality required? surely someone 
somewhere will want to write a das2 client that implements boolean 
queries

I speak from experience - I designed the GO Database API to have a very 
similar constraint language (it's expressed using perl hash keys rather 
than CGI parameters but the same basic idea). For years people have 
been clamouring for the ability to do more complex queries - right now 
they are forced bypass the constraint language and go direct to SQL.

>
>>  This lacks one of the most important features of a real query
>> language, composability. These ad-hoc constraint syntaxes have their
>> uses but you'll eventually want to go beyond the limits and end up
>> adding awkward extensions. Why not just forego the URL constraint and
>> go with a composable extendable query language in the first place and
>> save a lot of bother downstream?
>
> Because no one can decide on a generic language which is more
> powerful than this.
>
> Anything more powerful would need to support .. boolean algebra?
> numeric searches?  regexps?  What about quoting rules for "multiple
> word phrases"?
>
> Is it SQL-like?  XPath/XQuery-like?  Is it a context-free grammar?
> How easy is it to implement and work cross-platform?

None of these really lit into the DAS paradigm. I'm guessing you want 
something simple that can be used as easily as an API with get-by-X 
methods but will seamlessly blend into something more powerful. I think 
what you have is on the right lines. I'm just arguing to make this 
language composable from the outset, so that it can be extended to 
whatever expressivity is required in the future, without bolting on a 
new query system that's incompatible with the existing one.

The generic language could just be some kind of simple extensible 
function syntax for search terms, boolean operators, and some kind of 
(optional) nesting syntax.

If you have boolean operators and it's composable, then yep it does 
have to be as expressive as boolean algebra.

I'd argue that implementing a composable query language is easier than 
an ad-hoc one

> For what people need now, this search solution seems good.
>
> For the future we can have
>
>    <CAPABILITY type="xpath-query" query_uri="http://whatever" />
>
> and clients which understand that interface will know that it's
> there.

hmm, not sure how useful this would be - surely you'd want something 
more dasmodel-aware?

if you're going to just pass-through to xpath or sql then why have a 
das protocol at all?

>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From Gregg_Helt at affymetrix.com  Fri Mar 17 00:22:54 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Thu, 16 Mar 2006 16:22:54 -0800
Subject: [DAS2] query language description
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA3B@msex02.affymetrix.com>

For the type query filter, I'd suggest keeping the exacttype semantics
you discuss below, but using "type" for the field name rather than
"exacttype".  If we're getting rid of one of them, and a non-exact type
is a meaningless concept, it seems like keeping that "exact" part is
unnecessary and potentially confusing.

	gregg
> 
> I think we decided that there is no type inferencing done in
> the server; it's a client side thing.  In that case the 'type'
> field goes away.  We can still keep 'exacttype'.  The URI
> used for the matching is the type uri, and NOT the ontology URI.
> 
> (We don't have an ontology URI yet, and when we do we can add
> an 'ontology' query.)
> 
> The segment URI must accept the local identifier.  For
> interoperability with other servers they must also accept the
> equivalent global identifier, if there is one.
> 
> If range searches are given then one and only one segment is
> allowed.  Multiple segments may be given, but then ranges are not
> allowed.
> 
> The string searches support a simple search language.
>      ABC  -- contains a word which exactly matches "ABC" (identity,
not
> substring)
>     *ABC  -- words ending in "ABC"
>      ABC* -- words starting with "ABC"
>     *ABC* -- words containing the substring "ABC"
> 
> If you want a field which exactly contains a '*' you're kinda
> out of luck.  The interpretation of whitespace in the query or
> in the search string is implementation dependent.  For that
> matter, the meaning of "word" is implementation dependent.  (Is
> *O'Malley* one word? *Lethbridge-Stewart*?)
> 
> When we looked into this last month at Sanger we verified that
> all the databases could handle %substring% searches, which was
> all that people there wanted.  The Affy people want searches for
> exact word, prefix and suffix matches, as supported by the the
> back-end databases.
> 
> 
>    XXX CORRECT ME XXX
> 
> The 'name' search searches.... It used to search the 'name'
> attribute and the 'alias' fields.  There is no 'name' now.  I
> moved it to 'title'.  I think I did the wrong thing; it should
> be 'name', but it's a name meant for people, not computers.
> 
> Some features (sub-parts) don't have human-readable names so
> this field must be optional.
> 
> 
> The "prop-*" is a search of the <PROP> elements.  Features may
> have properties, like
> 
>     <PROP key="cellular_component" value="membrane" />
> 
> To do a string search for all 'membrane' cellular components,
> construct the query key by taking  the string "prop-" and
> appending the property key text ("cellular_component").  The
> query value is the text to search for.
> 
>      prop-cellular_component=membrane
> 
> To search for any cellular_component containing the substring "mem"
> 
>      prop-cellular_component=*membrane*
> 
> The rules for multiple searches with the same key also apply to the
> prop-* searches.  To search for all 'membrane' or 'nuclear'
> cellular components, use two 'prop-cellular_component' terms, as
> 
> 
> http://biodas.org/features?prop-cellular_component=membrane;prop-
> cellular_component=membrane
> 
> 
> The range searches are defined with explicit start and end
> coordinates.  The range syntax is in the form "start:end", for
> example, "1:9".
> 
> Let 'min' be the smallest coordinate for a feature on a given
> segment and 'max' be one larger than the largest coordinate.
> These are the lower and upper founds for the feature.
> 
> An 'overlaps' search matches if and only if
>     min < end AND max > start
> 
> XXX For GREG XXX
> 
> What do 'inside' and 'contains' do?  Can't we just get
> away with 'excludes', which has complement of 'overlaps'?
> Searches are done as:
>    Step 0) specify the segment
>    Step 1) do all the includes  (if none, match all features on
segment)
>    Step 2) do all the excludes, inverted (like an includes search)
>    Step 3) only return features which are in Step 1 but not
>        in Step 2)
>    Step 4) ...
>    Step 5) Profit!
> 
> I think this will support your smart code, and it's easy
> enough to implement.
> 
> Every one but you was planning to use 'overlaps'.  Only you
> wanted to use 'inside'.  Anyone want to use 'contains'?
> 
> 					Andrew
> 					dalke at dalkescientific.com
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Fri Mar 17 02:05:06 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 18:05:06 -0800
Subject: [DAS2] query language description
In-Reply-To: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
	<e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
	<8b7582943da22dfed23ba7b5386402fb@fruitfly.org>
Message-ID: <c4433b247b29525254354103b60ce414@dalkescientific.com>

Chris:
> ignorant question.. (I have only been tangentially aware of the outer 
> edges of the whole das2 process)..
>
> how are you determining the functionality required? surely someone 
> somewhere will want to write a das2 client that implements boolean 
> queries

It was informal, based on feedback from client developers and 
maintainers.
Lincoln, Thomas, Andreas, Gregg and others provided that feedback.
It was not by talking with users.

I know there's a wide range of users and use cases.  The point
of this query language is to have basic functionality that all
servers can implement.

> right now they are forced bypass the constraint language and go direct 
> to SQL.

In addition, we provide defined ways for a server to indicate
that there are additional ways to query the server.

> None of these really lit into the DAS paradigm. I'm guessing you want 
> something simple that can be used as easily as an API with get-by-X 
> methods but will seamlessly blend into something more powerful. I 
> think what you have is on the right lines. I'm just arguing to make 
> this language composable from the outset, so that it can be extended 
> to whatever expressivity is required in the future, without bolting on 
> a new query system that's incompatible with the existing one.

We have two ways to compose the system.  If the simple query language
is extended, for example, to support word searches of the text field
instead of substring searches, then a server can say

<CAPABILITY type="features" 
query_uri="http://somewhere.over.rainbow/server.cgi">
   <SUPPORTS name="word-search"/>
</CAPABILITY>

This is backwards compatible, so the normal DAS queries work.  But
a client can recognize the new feature and support whatever new filters
that 'word-search' indicates, eg

   http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*

(finds features with notes containing words starting with 'Andre' )

These are composable.  For example, suppose Sanger allows modification
date searches of curation events.  Then it might say

<CAPABILITY type="features" 
query_uri="http://somewhere.over.rainbow/server.cgi">
   <SUPPORTS name="word-search"/>
   <SUPPORTS name="sanger-curation"/>
</CAPABILITY>

and I can search for notes containing words starting with "Andre"
which were modified by "dalke" between 2002 and 2005 by doing

   http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
        modified-by=dalke&modified-before=2005&modified-after=2002


An advantage to the simple boolean logic of the current system
is that the GUI interface is easy, and in line with existing
simple search systems.


If someone wants to implement a new search system which is
not backwards compatible then the server can indicate that
alternative with a new CAPABILITY.  Suppose Thomas at Sanger
comes up with a new search mechanism based on an object query
language he invented,

<CAPABILITY type="down-oql"
     query_uri="http://sanger.ac.uk/oql-search" />

The Sanger and EBI clients might understand that and support
a more complex GUI, eg, with a text box interface.  Everyone
else must ignore unknown capability types.

Then that would be POSTED (or whatever the protocol defines)
to the given URL, which returns back whatever results are
desired.

Or the server can point to a public MySQL port, like

<CAPABILITY type="mysql-connection"
     query_uri="mysql://username:password at hostname:port/databasename" />

That's what you are doing to bypass the syntax, except that
here it isn't a bypass; you can define the new interface in
the DAS sources document.

> The generic language could just be some kind of simple
> extensible function syntax for search terms, boolean operators,
> and some kind of (optional) nesting syntax.

Which syntax?  Is it supposed to be easy for people to write?
Text oriented?  Or tree structured, like XML, or SQL-like?
And which clients and servers will implement that search
language?

If there was a generic language it would allow
   OR("on segment Chr1 between 1000 and 2000",
      "on segment ChrX between 99 and 777")
which is something we are expressly not allowing in DAS2
queries.  It doesn't make sense for the target applications
and by excluding it it simplifies the server development,
which means less chance for bugs.

Also, I personally haven't figured out a decent way to
do a GUI composition of a complex boolean query which is
as easy as learning the query language in the first place.

A more generic language implementation is a lot of overhead
if most (80%? 90%) need basic searches, and many of the
rest can fake it by breaking a request into parts and
doing the boolean logic on the client side.

Feedback I've heard so far is that DAS1 queries were
acceptable, with only a few new search fields needed.

> hmm, not sure how useful this would be - surely you'd want something
> more dasmodel-aware?

The example I gave was a bad one.  What I meant was to show
how there's an extension point so someone can develop a new
search interface and clients can know that the new functionality
exists, without having to change the DAS spec.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Fri Mar 17 04:47:58 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 20:47:58 -0800
Subject: [DAS2] query language description
In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
Message-ID: <a27d1b1df9baf83885ce73513ed70b96@dalkescientific.com>

Updated:
   - added 'note' as a query field
   - changed string searches to substring (not word) searches
        and made them be case insensitive

       "AB" matches only the strings "AB", "Ab", "aB" and "ab"
       "*AB" matches only fields which exactly end with
               "AB", "ab", "aB", and "Ab"
       "AB*" matches only fields which exactly match, up to case
       "*AB*" matches only fields which contain the substring,
             up to case

   - added 'coordinates' search

   - removed 'type' and renamed 'exacttype' to 'type'

   - removed 'contains' search, which no one said they wanted.  Instead,
      supporting (EXPERIMENTAL) an 'excludes' search.


==================================

The query fields are

   name      |  takes | matches features ...
  ==========================
   xid       |  URI   | which have the given xid
   type      |  URI   | with exactly the given type
   segment   |  URI   | on the given segment
coordinates |  URI   | which are part of the given coordinate system
   overlaps  | region | which overlap the given region
   excludes  | region | which have no overlap to the given region
   inside    | region | which are contained inside the given region
   name      | string | with a title or alias which matches the given  
string
   note      | string | with a note which matches the given string
   prop-*    | string | with the property "*" matching the given string

Queries are form-urlencoded requests.  For example, if the features
query URL is 'http://biodas.org/features' and there is a segment named
'http://ncbi.org/human/Chr1' then the following is a request for all the
features on the first 10,000 bases of that segment

The query is for
     segment = 'http://ncbi.org/human/Chr1'
     overlaps = 0:10000

which is form-urlencoded as

    
http://biodas.org/features? 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000

Multiple search terms with the same key are OR'ed together.  The  
following
searches for features containing the name or alias of either
BC048328 or BC015400

   http://biodas.org/features?name=BC048328;name=BC015400

The 'excludes' search is an exception.  See below.

Multiple search terms with different keys are AND'ed together,
but only after doing the OR search for each set of search terms with
identical keys.  The following searches for features which have
a name or alias of BC048328 or BC015400 and which are on the segment
http://ncbi.org/human/Chr1

    
http://biodas.org/features?name=BC048328; 
segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400

The order of the search terms in the query string does not affect
the results.

If any part of a complex feature (that is, one with parents
or parts) matches a search term then all of the parents and
parts are returned.  (XXX Gregg -- is this correct? XXX)


The fields which take URLs require exact matches, that is, a
character by character match.  (For details on the nuances of
comparing URIs see http://www.textuality.com/tag/uri-comp-3.html )

(We don't have an ontology URI yet, and when we do we can add
an 'ontology' query.)

The segment query filter takes a URI.  This must accept
the segment URI and, if known to the server, the equivalent
reference identifier for the segment.

If range searches are given then one and only one segment
must be given.  If there are multiple segment queries then
ranges are not allowed.

The string searches may be exact matches, substring, prefix
or suffix searches.  The query type depends on if the search
value starts and/or ends with a '*'.

     ABC  -- field exactly matches "ABC"
    *ABC  -- field ends with "ABC"
     ABC* -- field starts with "ABC"
    *ABC* -- field contains the substring "ABC"

The "*" has no special meaning except at the start or end
of the query value.  The search term "***" will match a
field which contains the character "*" anywhere.  There
is no way to match fields which exactly match '*' or
which only start or end with that character.

Text searches are case-insensitive.  The string "ABC"
matches "abc", "aBc", "ABC", etc.

A server may choose to collapse multiple whitespace
characters into a single space character for search purposes.
For example, the query "*a newline*" should match

   "This is a line of text which contains a
    newline"


The 'name' search does a text search of the 'title' and 'alias'
fields.


The "prop-*" is shorthand for a class of text searches of
<PROP> elements.  Features may have properties, like

    <PROP key="cellular_component" value="membrane" />

To do a string search for all 'membrane' cellular components,
construct the query key by taking  the string "prop-" and
appending the property key text ("cellular_component").  The
query value is the text to search for, in this case:

     prop-cellular_component=membrane

To search for any cellular_component containing the substring
"membrane"

     prop-cellular_component=*membrane*

The rules for multiple searches with the same key also apply to the
prop-* searches.  To search for all 'membrane' or 'nuclear'
cellular components, use two 'prop-cellular_component' terms, as

      
http://biodas.org/features?prop-cellular_component=membrane;prop- 
cellular_component=membrane


The range searches are defined with explicit start and end
coordinates.  The range syntax is in the form "start:end", for
example, "1:9".  There is no way to restrict the search to
a specific strand.

A feature may have several locations.  An annotation may
have several features in a parent/part relationship.  The
relationship may have several levels.  If a range search
matches any feature in the annotation then the search
returns all of the features in the annotation.

An 'overlaps' search matches if and only if any feature
location of any of the parent or part overlaps the query
range and segment.

An 'inside' search matches if and only if at least one
feature in the annotation has a location on the query segment
and all features which have a location on the query segment
have at least one location which starts and ends in the
query range.

EXPERIMENTAL: An 'excludes' matches if and only if at
least one feature of the annotation is on the query segment
and no features are in the query range.  This is the
complement of the 'overlaps' search, for annotations on
the same query segment.

Unlike the other search keys, if there multiple 'excludes'
searches then the results are AND'ed together.  That is,
if the query is has two excludes ranges
    segment=ChrX excludes=RANGE1 excludes=RANGE2
then the result are those features which on ChrX which
are not in RANGE1 and are not in RANGE2.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Fri Mar 17 07:05:54 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 23:05:54 -0800
Subject: [DAS2] alternate formats
Message-ID: <3f895441c38b74460da9f8e4582b7a74@dalkescientific.com>

If you've read the updated schema definitions you saw
I've added the following comment in the CAPABILITY

   # Format names which can be passed to the query_uri.
   # The names are type dependent.  At present the
   # only reserved names are for the 'features' capability.
   # These are: das2xml, count, uris
   format*,


We talked about this in the UK I think, and I mentioned
it to people here.  The 'count' format returns the count
of features which would be returned for a given query.

This is a single line containing the integer followed by
a newline.  The content-type of the document is text/plain .

For example, to get the number of all the features on
the server

Request:

http://www.example.com/das2/mus/v22/features?format=count

Response:

Content-Type: text/plain

129254


I will add this format description to the spec.


When does the server need to declare that it implements
a given document type?  My thought is that if the format
list is not specified then the server must implement
'das2xml' and 'count' formats.  If it doesn't implement
the 'count' format then it needs to declare the complete
list of what it does support.


In addition I'll describe here the 'uris' format.  It is
a document of content-type text/plain containing the
matching feature URIs, one per line.  For example,

file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.0.A1_3p_a_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.0.A1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.1.S1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.2.S1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: 
Hs.21346.3.S1_3p_x_at
file://Users/dalke/ucla/feature/Affymetrix_U133_X3P:Hs.271468.0.S1_3p_at


(I feel like it should implement an xml:base scheme to reduce
the amount of traffic.)

The idea is that a client can request the URIs only, eg,
to do more complex boolean-esque searches by doing simpler
ones on the server and combining the results in client space.
For another example, if the client already knows the feature
data for a URI then it doesn't need to download the data again.
So it gets a list of URIs then only fetches the ones it
does not know about.

This requires HTTP/1.1 pipelining for good performance.

Because there are no clients which want it, because I'm not
certain on the format, and because of the lack of pipelining
in the existing servers, I will not document this format.  I'll
just leave it as a reserved format name.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Fri Mar 17 07:33:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 16 Mar 2006 23:33:44 -0800
Subject: [DAS2] debugging validation proxy
Message-ID: <c5545a0079e981c657e86d60057859a4@dalkescientific.com>

After a conversation with Gregg this afternoon I this evening
implemented a debugging validation proxy for DAS.  The code
is about 100 lines long and combines Python's "twisted" network
library and the dasypus validator.

To make it work, configure your DAS client to use a proxy,
which is this validation proxy.  Then do things like normal.
The request go through the proxy.  It dumps the request
info to stdout and forwards the request to the real server.

It requires the response headers and body.  When finished
it passed the data to dasypus.

I stuck some DAS-ish XML on my company web server
and did the connection like this

% curl -x localhost:8080 http://www.dalkescientific.com/sources.xml

The output from the debug window is

Making request for 'http://www.dalkescientific.com/sources.xml'
Warning: Unknown Content-Type 'application/xml'.
Info: Assuming doctype of 'sources' based on root element at byte 40, 
line 2, column 2
Finished processing


					Andrew
					dalke at dalkescientific.com


From allenday at ucla.edu  Thu Mar 16 18:27:56 2006
From: allenday at ucla.edu (Allen Day)
Date: Thu, 16 Mar 2006 10:27:56 -0800 (PST)
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <200603151046.43196.lstein@cshl.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
Message-ID: <Pine.LNX.4.58.0603161027260.30576@sumo.ctrl.ucla.edu>

Hi Lincoln,

Please just code to what is there, and expect your code to break when I 
update the biopackages server to v300 (probably next week).

-Allen

On Wed, 15 Mar 2006, Lincoln Stein wrote:

> Hi Folks,
> 
> I just ran through the source request on biopackages.net and it is returning 
> something that is very different from the current spec (CVS updated as of 
> this morning UK time). I understand why there is a discrepancy, but for the 
> purposes of the code sprint, should I code to what the spec says or to what 
> biopackages.net returns? It is much more fun for me to code to a working 
> server because I have the opportunity to watch my code run.
> 
> Best,
> 
> Lincoln
> 
> 


From Gregg_Helt at affymetrix.com  Fri Mar 17 08:22:12 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Fri, 17 Mar 2006 00:22:12 -0800
Subject: [DAS2] New affymetrix das/2 development server
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA3D@msex02.affymetrix.com>


I checked in a new version of the Affymetrix DAS/2 server this evening
that supports XML responses based on the latest DAS/2 spec, version 300.
For sample sources, segments, types, and features responses it passes
the Dasypus validator tests.  The validator was _very_ useful for
bringing the server up to the current spec!  Steve rolled the new
version out on our public test server, the root sources query URL is
http://205.217.46.81:9091/das2/genome/sequence.  In the latest version
of IGB checked into CVS, this server can be accessed as "Affy-temp" in
the list of DAS/2 servers.

Although the server's XML responses conform to spec v.300, the query
strings it recognizes still only conform to a subset of spec v.200.  I
expect to have the queries upgraded to v.300 tonight.  But it will
probably still only support a subset of the query filters: one type
(required), one overlaps (required), one inside (optional).

This server also supports bed, psl, and some binary formats as
alternative content formats, depending on the type of the annotations.

	gregg

> -----Original Message-----
> From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-
> bio.org] On Behalf Of Steve Chervitz
> Sent: Wednesday, March 15, 2006 1:25 PM
> To: DAS/2
> Subject: [DAS2] New affymetrix das/2 development server
> 
> 
> Gregg's latest spec-compliant, but still development-grade, das/2
server
> is
> now publically available via http://205.217.46.81:9091
> 
> It's currently serving annotations from the following assemblies:
> - human hg16
> - human hg17
> - drosophila dm2
> 
> Send me requests for any other data sources that would help your
> development
> efforts.
> 
> Example query to get back a das-source xml document:
> http://205.217.46.81:9091/das2/genome/sequence
> 
> It's compliance with the spec is steadily improving, on a daily if not
> hourly basis during the code sprint.
> 
> Within IGB you can access this server from the DAS/2 servers tab
> under 'Affy-temp'.
> 
> You'll need the latest version of IGB from the CVS repository at
> http://sf.net/projects/genoviz
> 
> Steve
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Fri Mar 17 16:09:44 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 17 Mar 2006 08:09:44 -0800
Subject: [DAS2] biopackages.net out of synch with spec?
In-Reply-To: <Pine.LNX.4.58.0603161027260.30576@sumo.ctrl.ucla.edu>
References: <C71929195D04BF48BAECD499AF717B480198CA2D@msex02.affymetrix.com>
	<200603151046.43196.lstein@cshl.edu>
	<Pine.LNX.4.58.0603161027260.30576@sumo.ctrl.ucla.edu>
Message-ID: <e5e618da98f01b9a6cbc2b97cc8e34e1@dalkescientific.com>

Allen:
> Please just code to what is there, and expect your code to break when I
> update the biopackages server to v300 (probably next week).

So you all know, "300" is what we've been calling the current
version of the spec, based on the code freeze that started 8
hours ago.  It's the one currently only described in the schema
definitions and in the example files under das/das2/draft3.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Fri Mar 17 16:40:20 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 17 Mar 2006 08:40:20 -0800
Subject: [DAS2] proxies, caching and network configuration
Message-ID: <58f16cd7fac095a708fd81a5cc5e40df@dalkescientific.com>

I'm writing to encourage DAS client authors to include
support for proxies when fetching DAS URLs.

Nomi pointed out that Apollo supports proxies, because
users asked for it.  I think it's because some sites
don't have direct access to the internet.  I know a few
of my clients have internal networks set up that way.

Yesterday we talked a bit about how to point to local
mirrors.  It would be hard to have a standard configuration
so that all DAS client code can know about local mirrors.
I mentioned setting up proxies, but dismissed the idea.

Now I'm thinking that that might be the solution.  If
there are local ways to get, say, sequence data then
that could be done at the proxy level.  Someone can easily
(with less than 100 lines of code) write a new proxy
server which points to a local resource if it knows that
a URI is resolvable that way.

Having proxy support also helps with debugging, like in
the debugging proxy server I wrote yesterday.

A nice thing is that some people want proxy support
anyway, so if client code supports proxies then these
other things (redirection to local mirrors, debugging)
can be set up later, and with no extra work in the client.

					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Fri Mar 17 18:47:51 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Fri, 17 Mar 2006 10:47:51 -0800
Subject: [DAS2] New affymetrix das/2 development server
In-Reply-To: <C03DC62B.1D090%Steve_Chervitz@affymetrix.com>
Message-ID: <C0404457.1D185%Steve_Chervitz@affymetrix.com>


The affy das/2 development server at http://205.217.46.81:9091 has been
updated to better support DAS/2 spec version 300.

Gregg says:
> Changed genometry DAS/2 server so that it responds to feature queries that use
> DAS/2 v.300 feature filters.  Currently implements a subset of
> the v.300 feature query spec:
>     requires one and only one segment filter
>     requires one and only one type filter
>     accepts zero or one inside filter
> Also attempts to support DAS/2 v.200 feature filters, but success is not
> guaranteed.

Steve 

> From: Steve Chervitz <Steve_Chervitz at affymetrix.com>
> Date: Wed, 15 Mar 2006 13:24:59 -0800
> To: DAS/2 <das2 at lists.open-bio.org>
> Conversation: New affymetrix das/2 development server
> Subject: New affymetrix das/2 development server
> 
> 
> Gregg's latest spec-compliant, but still development-grade, das/2 server is
> now publically available via http://205.217.46.81:9091
> 
> It's currently serving annotations from the following assemblies:
> - human hg16 
> - human hg17 
> - drosophila dm2
> 
> Send me requests for any other data sources that would help your development
> efforts.
> 
> Example query to get back a das-source xml document:
> http://205.217.46.81:9091/das2/genome/sequence
> 
> It's compliance with the spec is steadily improving, on a daily if not hourly
> basis during the code sprint.
> 
> Within IGB you can access this server from the DAS/2 servers tab
> under 'Affy-temp'.
> 
> You'll need the latest version of IGB from the CVS repository at
> http://sf.net/projects/genoviz
> 
> Steve


From dalke at dalkescientific.com  Fri Mar 17 20:09:42 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 17 Mar 2006 12:09:42 -0800
Subject: [DAS2] defined minimum limits
Message-ID: <dccc4a7c5783153b73b7537073df36e7@dalkescientific.com>

We should define minimum sizes for fields in the server database.
For example, "the server must support feature titles of at least
40 characters", "must handle at least two 'excludes' feature filters".

And define what do to when the server decides that writeback of
a 30MB feature is just a bit too large.

					Andrew
					dalke at dalkescientific.com


From boconnor at ucla.edu  Fri Mar 17 23:23:09 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Fri, 17 Mar 2006 15:23:09 -0800
Subject: [DAS2] das.biopackages.net Updated to Spec 300
Message-ID: <441B44DD.5010505@ucla.edu>

Hi,

So I checked in my changes to the DAS/2 server which should bring it up 
to the 300 spec.  Allen updated the das.biopackages.net server and I 
tested the following URLs in Andrew's validation app.  They all appear 
to be OK:

* http://das.biopackages.net/das/genome
* http://das.biopackages.net/das/genome/yeast
* http://das.biopackages.net/das/genome/human
* http://das.biopackages.net/das/genome/yeast/S228C
* http://das.biopackages.net/das/genome/human/17
* http://das.biopackages.net/das/genome/yeast/S228C/segment
* http://das.biopackages.net/das/genome/human/17/segment
* http://das.biopackages.net/das/genome/yeast/S228C/type
* http://das.biopackages.net/das/genome/human/17/type
* 
http://das.biopackages.net/das/genome/yeast/S228C/feature?overlaps=chrI/1:1000
* 
http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr1/1000:2000

Let Allen or I know if you run into problems.

--Brian


From cjm at fruitfly.org  Sat Mar 18 00:20:14 2006
From: cjm at fruitfly.org (chris mungall)
Date: Fri, 17 Mar 2006 16:20:14 -0800
Subject: [DAS2] query language description
In-Reply-To: <c4433b247b29525254354103b60ce414@dalkescientific.com>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<e3ad320577f173bc0234ca4df6d16645@fruitfly.org>
	<e3e550c9fe3d623420739a36073d0891@dalkescientific.com>
	<8b7582943da22dfed23ba7b5386402fb@fruitfly.org>
	<c4433b247b29525254354103b60ce414@dalkescientific.com>
Message-ID: <da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>


On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote:

>> right now they are forced bypass the constraint language and go direct
>> to SQL.
>
> In addition, we provide defined ways for a server to indicate
> that there are additional ways to query the server.

I was positing this as a bad feature, not a good one. or at least a 
symptom of an incorrectly designed system (at least in the case of the 
GO DB API - it may not carry forward to DAS - though if you're going to 
allow querying by terms...)

>
>> None of these really lit into the DAS paradigm. I'm guessing you want
>> something simple that can be used as easily as an API with get-by-X
>> methods but will seamlessly blend into something more powerful. I
>> think what you have is on the right lines. I'm just arguing to make
>> this language composable from the outset, so that it can be extended
>> to whatever expressivity is required in the future, without bolting on
>> a new query system that's incompatible with the existing one.
>
> We have two ways to compose the system.  If the simple query language
> is extended, for example, to support word searches of the text field
> instead of substring searches, then a server can say
>
> <CAPABILITY type="features"
> query_uri="http://somewhere.over.rainbow/server.cgi">
>    <SUPPORTS name="word-search"/>
> </CAPABILITY>
>
> This is backwards compatible, so the normal DAS queries work.  But
> a client can recognize the new feature and support whatever new filters
> that 'word-search' indicates, eg
>
>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*
>
> (finds features with notes containing words starting with 'Andre' )
>
> These are composable.  For example, suppose Sanger allows modification
> date searches of curation events.  Then it might say
>
> <CAPABILITY type="features"
> query_uri="http://somewhere.over.rainbow/server.cgi">
>    <SUPPORTS name="word-search"/>
>    <SUPPORTS name="sanger-curation"/>
> </CAPABILITY>

so this is limited to single-argument search functions?

>
> and I can search for notes containing words starting with "Andre"
> which were modified by "dalke" between 2002 and 2005 by doing
>
>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
>         modified-by=dalke&modified-before=2005&modified-after=2002

but the compositionality is always associative since the CGI parameter 
constraint forbids nesting

> An advantage to the simple boolean logic of the current system
> is that the GUI interface is easy, and in line with existing
> simple search systems.

there's nothing preventing you from implementing a simple GUI on top of 
an expressive system - there is nothing forcing you to use the 
expressivity

> If someone wants to implement a new search system which is
> not backwards compatible then the server can indicate that
> alternative with a new CAPABILITY.  Suppose Thomas at Sanger
> comes up with a new search mechanism based on an object query
> language he invented,
>
> <CAPABILITY type="down-oql"
>      query_uri="http://sanger.ac.uk/oql-search" />
>
> The Sanger and EBI clients might understand that and support
> a more complex GUI, eg, with a text box interface.  Everyone
> else must ignore unknown capability types.

but this doesn't integrate with the existing query system

>
> Then that would be POSTED (or whatever the protocol defines)
> to the given URL, which returns back whatever results are
> desired.
>
> Or the server can point to a public MySQL port, like
>
> <CAPABILITY type="mysql-connection"
>      query_uri="mysql://username:password at hostname:port/databasename" 
> />
>
> That's what you are doing to bypass the syntax, except that
> here it isn't a bypass; you can define the new interface in
> the DAS sources document.
>
>> The generic language could just be some kind of simple
>> extensible function syntax for search terms, boolean operators,
>> and some kind of (optional) nesting syntax.
>
> Which syntax?  Is it supposed to be easy for people to write?
> Text oriented?  Or tree structured, like XML, or SQL-like?

I'd favour some concrete asbtract syntax that looks much like the 
existing DAS QL

> And which clients and servers will implement that search
> language?

all servers. clients optional

>
> If there was a generic language it would allow
>    OR("on segment Chr1 between 1000 and 2000",
>       "on segment ChrX between 99 and 777")
> which is something we are expressly not allowing in DAS2
> queries.  It doesn't make sense for the target applications
> and by excluding it it simplifies the server development,
> which means less chance for bugs.

this example is pointless but it's easy to imagine plenty of ontology 
term queries or other queries in which this would be useful

I guess I depart from the normal DAS philosophy - I don't see this 
being a high barrier for entry for servers, if they're not up to this 
it'll probably be a buggy hacky server anyway

> Also, I personally haven't figured out a decent way to
> do a GUI composition of a complex boolean query which is
> as easy as learning the query language in the first place.

doesn't mean it doesn't exist.

i'm not sure what's hard about having say, a clipboard of favourite 
queries, then allowing some kind of drag-and-drop composition

> A more generic language implementation is a lot of overhead
> if most (80%? 90%) need basic searches, and many of the
> rest can fake it by breaking a request into parts and
> doing the boolean logic on the client side.

this is always an option - if the user doesn't mind the additional 
possibly very high overhead. it's just a little bit of a depressing 
approach, as if Codd's seminal paper from 1970 or whenever it was never 
happened.

> Feedback I've heard so far is that DAS1 queries were
> acceptable, with only a few new search fields needed.
>
>> hmm, not sure how useful this would be - surely you'd want something
>> more dasmodel-aware?
>
> The example I gave was a bad one.  What I meant was to show
> how there's an extension point so someone can develop a new
> search interface and clients can know that the new functionality
> exists, without having to change the DAS spec.

ok

that's probably all I've got to say on the matter, sorry for being 
irksome. I guess I'm fundamentally missing something, that is, why wrap 
simple and expressive declarative query languages with limited ad-hoc 
constraint systems with consciously limited expressivity and limited 
means of extensibility..

cheers
chris

>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From Steve_Chervitz at affymetrix.com  Mon Mar 20 04:54:36 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Sun, 19 Mar 2006 20:54:36 -0800
Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006
Message-ID: <C043758C.1D23A%Steve_Chervitz@affymetrix.com>

Notes from DAS/2 code sprint #2, day five, 17 Mar 2006

$Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Dalke Scientific: Andrew Dalke (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

Agenda: 

* Status reports
* Writeback progress


Status reports:
---------------

gh: This is the last mtg of code sprint. For the status reports, focus
on where you are at and what you are hoping to accomplish post-sprint.

gh: working on version of affy server that impls das/2 v300 spec for
all xml responses. sample responses passed andrew's validation.
steve rolled it out to public server.

updated igb client to handle v300 xml.
worked more on server to impl v300 query syntax using full uri for
type segment, segment separate from overlaps and inside.
only impls a subset of the feature query. requires one and only one
segment, type, insides.

hoping todo for rest of sprint and after:
1. supporting name feat filters in igb client
2. remove restrictions from the server
3. making sure new version of server gets rolled out,
4. roll out jar for this version of igb. maybe put on genoviz sf site for
testing purposes.

bo: looked at xml docs that andrew checked in, updating ucla templates
on server, not rolled out to biopackages.net, waiting to make rpm,
hoping to do code cleanup in igb.
getting andrew's help running validator on local copy of server.

gh: igb would like to support v300, but one server is v200+ (ucla),
one at v300 (affy) complicates things. so getting your server good to
go would be my priority.

bo: code clean up involves assay and ontology interface.

gh: we're planning an igb release at end of march. as long as the code
is clean by then it's ok.

aday: code cleanup, things removed from protocol. exporting data
matrices from assay part of server.
validate sources document w/r/t v300 validator. work with brian to
make sure everything is update to v300. probably working on fiter
query, since we now treat things as names not full uri's.

ad: what extra config info do you need in server for that? can you get
it from the http headers?
gh: mine is being promiscuous, just name of type will work. might give
the wrong thing back, but for data we're serving back now, it can't be
wrong.

ad: how much trouble does the uri handling cause for you?

gh: has to be full uri of the type, doing otherwise is not an option
(in the spec).
ad: you could just use name internally, then put together full uri
when you go to the outside world.

ad: I updated comments in schema definitions, updated query lang
spec. string searches are substring searches not word-substring
searches. 
abc = whole field must be equal
*abc = suffix match
abc* = prefix match

previously said it was word match, but that's too complicated on
server.
worked with gregg to pin down what inside search means.

I'm thinking about the possibility of a validating proxy server,
configure das client to go through proxy before outside world, the
server would sniff everything going by.
Support for proxys can enable lots of sorts of things w/o needing
additional config for each client.

gh: how do you do proxy in java? i.e., redirect all network calls to a
proxy.
bo: there's a way to set proxy options via the system object in the
java vm. can show you some examples of this.

aday: performance.
gh: current webstart based ibg works with the existing public das/2
server, [comment pertaining to: the new version of igb and a new
version of the affy das/2 server].

ad: when will we get reference names from lincoln?
gh: should happen yesterday. poke him about this.
would be really nice to be able to overlay anotations!

The current version of igb can turn off v300 options, and then ti can
load stuff from the ucla server. The version of igb in cvs now can hit
both biopackages.net and affy server in the dmz. and there's
hardwiring to get things to overlay. temporary patch.

ee: two things:
1. style sheets. info from andrew yesterday. looking over that. will
   discuss questions w/ andrew.
2. making sure that when we do a new release of igb in a couple of
   weeks (when I'm not here) that it will go smoothly . go over w/
   gregg, steve. lots of testing.
made changes in parser code, should still work.

sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com.
ee: it's the das/1 I'm most concerned about.

sc: installed and updated gregg's new das/2 server on a publically
accessible machine (separate box from the production das/1 and das/2
servers on netaffxdas.affymetrix.com).
Also spent a time loading data for new affy arrays (mouse rat
exons). this required lots of memory, had to disable support for some
other arrays. [gregg's das servers load all annotations into memory at
start up, hance the big memory requirements for arrays with lots of
probe sets.]

[A] gregg optimize affy das server memory reqts for exon arrays.

gh: we' gotten a lot done this week. I think we have a stable spec.

gh: serving alignments, no cigars, but blat alignment to genome as
coords on mrna and coords on the genome. igb doesn't use it yet, but
it's there.
ad: xid in region elements.
gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to
xid in das/2?
ad: yes. i believe
gh: if you have links in das/1. without links it can build links from
feature id using a template. This is used for building links from
within IGB back to netaffx, for example.

Topic: Writebacks
-----------------

gh: writebacks haven't been mentioned at all this week.
ad: we need people committed to writing a server to implement it.
gh: we decided that since ed griffith would be working on it at
Sanger, we wouldn't worry about it for ucla server.
bo: we started prototyping. locking mechanism. persisting part of a
mage document. the spec changed after that. andrew's delta model. a
little different from what we were prototyping.
actual persistence will be done in the assay portion of our server.
gh: grant focuses on write back for genome portion, and this was a big
chunk of the grant. ends in end of may or june.

ad: delta model was: here's a list of add, delete, modify in one
document. An issue was if you change an existing record, do you give
it a new identifier?
gh: you never modify something with an existing id, just make a new
one, new id, with a pointer back to old one. Ed Griffith said this a
month ago. I like this idea. but told we cannot make this requirement
on the database. but very few dbs will be writeback, so it's not
affecting all servers

ad: making new uris, client has to know the new uri for the old
one. needs to return a mapping document.
if network crashes partway through, client won't know mapping is and
will be lost.
gh: server doesn't know if client got it. it could act(?) back.

gh: if a response from http server dies, server has no way to know.
ad: There could be a proxy in the middle, or isp's proxy server. The
server sent it successfully to the proxy, but never made it to the
client. 

gh: how is this dealt with for commits into relational dbs? same thing
applies 
ad: don't know
ee: could ask for everything in this region.
ad: have a new element that says 'i used to be this'.
bo: you do an insert in a db, to get last pk that was issued. client
talks back to server, give me last feature uri that was provisioned on
my connection. so the client is in control.

sc: it's up to client to get confirmation from server. If it failed to
get the response after sending in the modification request, it could
request that the server send it again.

ad: (drawing on whiteboard) two stage strategy, get a transaction state.

     post "get transaction url"
    <---------------
    post (put?) to transaction URL
    ------------->
    can do multiple (if identical)
       ---------->
       ---------->
    Get was successful and here's transformation info
    <---------------

ad: server can hold transformation info for some timespan in case
client needs to re-fetch.

gh: I'm more insterested in getting a server up than a client
regarding writeback. complex parts of the client are already
implemented (apollo).

gh: locks are region based not feature based.
ad: can't lock...

gh: we can talk about how to trigger ucla locking mechanism.
bo: did flock transactional locking the suggested in perl
cookbook. mage document has content. server locks an id using flock,
(for assay das).
gh: to lock a region on the genome, lock on all ids for features in
this region.
bo: make a file containing all the ids that are locked. flock this
file. 

ad: file locking is frought with problems. why not keep it in the
database and let the db lock it for you. don't let perl + file system
do it for you. there could be fs problems. nfs isn't good at that. a
database is much more reliable.

bo: I went with perl flock mechanism since you could have other
non-database sources (though so far it's all db).

[A] steve, allen send brian code tips regarding locking.

gh: putting aside pushing large data chunks into the server, for
curation it's ok if protocol is a little error prone, since the
curator-caused errors will be much more likely/common.

ad: UK folks haven't done any writeback work as far as I know.
gh: they haven't billed us in 2 years. Tony cox is contact, ed
griffith is main developer.
ad: andreas and thomas are not funded by this grant or the next one.
gh: they are already funded by other means.

ad: if someone want's to change an annotation should they need to get
a lock first or can it work like cvs? do it if it can, get lock,
release lock in one transaction.
ee: that's my preference.

ad: if every feature has it's own id, you know if it's...

ee: some servers might not have any writeback facility at
all. conflicts will be rare.

[A] ask ed/tony on whether they plan to have any writeback facility

gh: ed g wanted to work on client to do writeback, don't know who
would work on a server there.
ad: someone else, can't remember - roy?
gh: unless we hear back from sanger, the highest priority for ucla
folks after updating server for v300, is working server-side
writeback. 

gh: spec freeze is for the read portion. the writeback portion will
have to change as needed.
ad: and arithmetic? ;-)


From lstein at cshl.edu  Mon Mar 20 17:27:59 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 20 Mar 2006 12:27:59 -0500
Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006
In-Reply-To: <C043758C.1D23A%Steve_Chervitz@affymetrix.com>
References: <C043758C.1D23A%Steve_Chervitz@affymetrix.com>
Message-ID: <200603201227.59816.lstein@cshl.edu>

Hi Folks,

I will join the DAS2 call a little late today (no more than 10 min). I'm 
assuming that we're on?

Lincoln

On Sunday 19 March 2006 23:54, Steve Chervitz wrote:
> Notes from DAS/2 code sprint #2, day five, 17 Mar 2006
>
> $Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $
>
> Note taker: Steve Chervitz
>
> Attendees:
>   Affy: Steve Chervitz, Ed E., Gregg Helt
>   Dalke Scientific: Andrew Dalke (at Affy)
>   UCLA: Allen Day, Brian O'Connor (at Affy)
>
> Action items are flagged with '[A]'.
>
> These notes are checked into the biodas.org CVS repository at
> das/das2/notes/2006. Instructions on how to access this
> repository are at http://biodas.org
>
> DISCLAIMER:
> The note taker aims for completeness and accuracy, but these goals are
> not always achievable, given the desire to get the notes out with a
> rapid turnaround. So don't consider these notes as complete minutes
> from the meeting, but rather abbreviated, summarized versions of what
> was discussed. There may be errors of commission and omission.
> Participants are welcome to post comments and/or corrections to these
> as they see fit.
>
> Agenda:
>
> * Status reports
> * Writeback progress
>
>
> Status reports:
> ---------------
>
> gh: This is the last mtg of code sprint. For the status reports, focus
> on where you are at and what you are hoping to accomplish post-sprint.
>
> gh: working on version of affy server that impls das/2 v300 spec for
> all xml responses. sample responses passed andrew's validation.
> steve rolled it out to public server.
>
> updated igb client to handle v300 xml.
> worked more on server to impl v300 query syntax using full uri for
> type segment, segment separate from overlaps and inside.
> only impls a subset of the feature query. requires one and only one
> segment, type, insides.
>
> hoping todo for rest of sprint and after:
> 1. supporting name feat filters in igb client
> 2. remove restrictions from the server
> 3. making sure new version of server gets rolled out,
> 4. roll out jar for this version of igb. maybe put on genoviz sf site for
> testing purposes.
>
> bo: looked at xml docs that andrew checked in, updating ucla templates
> on server, not rolled out to biopackages.net, waiting to make rpm,
> hoping to do code cleanup in igb.
> getting andrew's help running validator on local copy of server.
>
> gh: igb would like to support v300, but one server is v200+ (ucla),
> one at v300 (affy) complicates things. so getting your server good to
> go would be my priority.
>
> bo: code clean up involves assay and ontology interface.
>
> gh: we're planning an igb release at end of march. as long as the code
> is clean by then it's ok.
>
> aday: code cleanup, things removed from protocol. exporting data
> matrices from assay part of server.
> validate sources document w/r/t v300 validator. work with brian to
> make sure everything is update to v300. probably working on fiter
> query, since we now treat things as names not full uri's.
>
> ad: what extra config info do you need in server for that? can you get
> it from the http headers?
> gh: mine is being promiscuous, just name of type will work. might give
> the wrong thing back, but for data we're serving back now, it can't be
> wrong.
>
> ad: how much trouble does the uri handling cause for you?
>
> gh: has to be full uri of the type, doing otherwise is not an option
> (in the spec).
> ad: you could just use name internally, then put together full uri
> when you go to the outside world.
>
> ad: I updated comments in schema definitions, updated query lang
> spec. string searches are substring searches not word-substring
> searches.
> abc = whole field must be equal
> *abc = suffix match
> abc* = prefix match
>
> previously said it was word match, but that's too complicated on
> server.
> worked with gregg to pin down what inside search means.
>
> I'm thinking about the possibility of a validating proxy server,
> configure das client to go through proxy before outside world, the
> server would sniff everything going by.
> Support for proxys can enable lots of sorts of things w/o needing
> additional config for each client.
>
> gh: how do you do proxy in java? i.e., redirect all network calls to a
> proxy.
> bo: there's a way to set proxy options via the system object in the
> java vm. can show you some examples of this.
>
> aday: performance.
> gh: current webstart based ibg works with the existing public das/2
> server, [comment pertaining to: the new version of igb and a new
> version of the affy das/2 server].
>
> ad: when will we get reference names from lincoln?
> gh: should happen yesterday. poke him about this.
> would be really nice to be able to overlay anotations!
>
> The current version of igb can turn off v300 options, and then ti can
> load stuff from the ucla server. The version of igb in cvs now can hit
> both biopackages.net and affy server in the dmz. and there's
> hardwiring to get things to overlay. temporary patch.
>
> ee: two things:
> 1. style sheets. info from andrew yesterday. looking over that. will
>    discuss questions w/ andrew.
> 2. making sure that when we do a new release of igb in a couple of
>    weeks (when I'm not here) that it will go smoothly . go over w/
>    gregg, steve. lots of testing.
> made changes in parser code, should still work.
>
> sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com.
> ee: it's the das/1 I'm most concerned about.
>
> sc: installed and updated gregg's new das/2 server on a publically
> accessible machine (separate box from the production das/1 and das/2
> servers on netaffxdas.affymetrix.com).
> Also spent a time loading data for new affy arrays (mouse rat
> exons). this required lots of memory, had to disable support for some
> other arrays. [gregg's das servers load all annotations into memory at
> start up, hance the big memory requirements for arrays with lots of
> probe sets.]
>
> [A] gregg optimize affy das server memory reqts for exon arrays.
>
> gh: we' gotten a lot done this week. I think we have a stable spec.
>
> gh: serving alignments, no cigars, but blat alignment to genome as
> coords on mrna and coords on the genome. igb doesn't use it yet, but
> it's there.
> ad: xid in region elements.
> gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to
> xid in das/2?
> ad: yes. i believe
> gh: if you have links in das/1. without links it can build links from
> feature id using a template. This is used for building links from
> within IGB back to netaffx, for example.
>
> Topic: Writebacks
> -----------------
>
> gh: writebacks haven't been mentioned at all this week.
> ad: we need people committed to writing a server to implement it.
> gh: we decided that since ed griffith would be working on it at
> Sanger, we wouldn't worry about it for ucla server.
> bo: we started prototyping. locking mechanism. persisting part of a
> mage document. the spec changed after that. andrew's delta model. a
> little different from what we were prototyping.
> actual persistence will be done in the assay portion of our server.
> gh: grant focuses on write back for genome portion, and this was a big
> chunk of the grant. ends in end of may or june.
>
> ad: delta model was: here's a list of add, delete, modify in one
> document. An issue was if you change an existing record, do you give
> it a new identifier?
> gh: you never modify something with an existing id, just make a new
> one, new id, with a pointer back to old one. Ed Griffith said this a
> month ago. I like this idea. but told we cannot make this requirement
> on the database. but very few dbs will be writeback, so it's not
> affecting all servers
>
> ad: making new uris, client has to know the new uri for the old
> one. needs to return a mapping document.
> if network crashes partway through, client won't know mapping is and
> will be lost.
> gh: server doesn't know if client got it. it could act(?) back.
>
> gh: if a response from http server dies, server has no way to know.
> ad: There could be a proxy in the middle, or isp's proxy server. The
> server sent it successfully to the proxy, but never made it to the
> client.
>
> gh: how is this dealt with for commits into relational dbs? same thing
> applies
> ad: don't know
> ee: could ask for everything in this region.
> ad: have a new element that says 'i used to be this'.
> bo: you do an insert in a db, to get last pk that was issued. client
> talks back to server, give me last feature uri that was provisioned on
> my connection. so the client is in control.
>
> sc: it's up to client to get confirmation from server. If it failed to
> get the response after sending in the modification request, it could
> request that the server send it again.
>
> ad: (drawing on whiteboard) two stage strategy, get a transaction state.
>
>      post "get transaction url"
>     <---------------
>     post (put?) to transaction URL
>     ------------->
>     can do multiple (if identical)
>        ---------->
>        ---------->
>     Get was successful and here's transformation info
>     <---------------
>
> ad: server can hold transformation info for some timespan in case
> client needs to re-fetch.
>
> gh: I'm more insterested in getting a server up than a client
> regarding writeback. complex parts of the client are already
> implemented (apollo).
>
> gh: locks are region based not feature based.
> ad: can't lock...
>
> gh: we can talk about how to trigger ucla locking mechanism.
> bo: did flock transactional locking the suggested in perl
> cookbook. mage document has content. server locks an id using flock,
> (for assay das).
> gh: to lock a region on the genome, lock on all ids for features in
> this region.
> bo: make a file containing all the ids that are locked. flock this
> file.
>
> ad: file locking is frought with problems. why not keep it in the
> database and let the db lock it for you. don't let perl + file system
> do it for you. there could be fs problems. nfs isn't good at that. a
> database is much more reliable.
>
> bo: I went with perl flock mechanism since you could have other
> non-database sources (though so far it's all db).
>
> [A] steve, allen send brian code tips regarding locking.
>
> gh: putting aside pushing large data chunks into the server, for
> curation it's ok if protocol is a little error prone, since the
> curator-caused errors will be much more likely/common.
>
> ad: UK folks haven't done any writeback work as far as I know.
> gh: they haven't billed us in 2 years. Tony cox is contact, ed
> griffith is main developer.
> ad: andreas and thomas are not funded by this grant or the next one.
> gh: they are already funded by other means.
>
> ad: if someone want's to change an annotation should they need to get
> a lock first or can it work like cvs? do it if it can, get lock,
> release lock in one transaction.
> ee: that's my preference.
>
> ad: if every feature has it's own id, you know if it's...
>
> ee: some servers might not have any writeback facility at
> all. conflicts will be rare.
>
> [A] ask ed/tony on whether they plan to have any writeback facility
>
> gh: ed g wanted to work on client to do writeback, don't know who
> would work on a server there.
> ad: someone else, can't remember - roy?
> gh: unless we hear back from sanger, the highest priority for ucla
> folks after updating server for v300, is working server-side
> writeback.
>
> gh: spec freeze is for the read portion. the writeback portion will
> have to change as needed.
> ad: and arithmetic? ;-)
>
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From lstein at cshl.edu  Mon Mar 20 17:32:40 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 20 Mar 2006 12:32:40 -0500
Subject: [DAS2] query language description
In-Reply-To: <da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<c4433b247b29525254354103b60ce414@dalkescientific.com>
	<da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>
Message-ID: <200603201232.41522.lstein@cshl.edu>

The current filter query language, which provides one level of ANDs and a 
nested level of ORs, satisfies our use cases. It is not clear to me what 
additional benefit we'll get from a composable query language. Note that none 
of the popular and functional genome information sources -- NCBI, UCSC, 
Ensembl or BioMart -- offer a composable query language, and there does not 
seem to be rioting on the streets!

Lincoln


On Friday 17 March 2006 19:20, chris mungall wrote:
> On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote:
> >> right now they are forced bypass the constraint language and go direct
> >> to SQL.
> >
> > In addition, we provide defined ways for a server to indicate
> > that there are additional ways to query the server.
>
> I was positing this as a bad feature, not a good one. or at least a
> symptom of an incorrectly designed system (at least in the case of the
> GO DB API - it may not carry forward to DAS - though if you're going to
> allow querying by terms...)
>
> >> None of these really lit into the DAS paradigm. I'm guessing you want
> >> something simple that can be used as easily as an API with get-by-X
> >> methods but will seamlessly blend into something more powerful. I
> >> think what you have is on the right lines. I'm just arguing to make
> >> this language composable from the outset, so that it can be extended
> >> to whatever expressivity is required in the future, without bolting on
> >> a new query system that's incompatible with the existing one.
> >
> > We have two ways to compose the system.  If the simple query language
> > is extended, for example, to support word searches of the text field
> > instead of substring searches, then a server can say
> >
> > <CAPABILITY type="features"
> > query_uri="http://somewhere.over.rainbow/server.cgi">
> >    <SUPPORTS name="word-search"/>
> > </CAPABILITY>
> >
> > This is backwards compatible, so the normal DAS queries work.  But
> > a client can recognize the new feature and support whatever new filters
> > that 'word-search' indicates, eg
> >
> >    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*
> >
> > (finds features with notes containing words starting with 'Andre' )
> >
> > These are composable.  For example, suppose Sanger allows modification
> > date searches of curation events.  Then it might say
> >
> > <CAPABILITY type="features"
> > query_uri="http://somewhere.over.rainbow/server.cgi">
> >    <SUPPORTS name="word-search"/>
> >    <SUPPORTS name="sanger-curation"/>
> > </CAPABILITY>
>
> so this is limited to single-argument search functions?
>
> > and I can search for notes containing words starting with "Andre"
> > which were modified by "dalke" between 2002 and 2005 by doing
> >
> >    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
> >         modified-by=dalke&modified-before=2005&modified-after=2002
>
> but the compositionality is always associative since the CGI parameter
> constraint forbids nesting
>
> > An advantage to the simple boolean logic of the current system
> > is that the GUI interface is easy, and in line with existing
> > simple search systems.
>
> there's nothing preventing you from implementing a simple GUI on top of
> an expressive system - there is nothing forcing you to use the
> expressivity
>
> > If someone wants to implement a new search system which is
> > not backwards compatible then the server can indicate that
> > alternative with a new CAPABILITY.  Suppose Thomas at Sanger
> > comes up with a new search mechanism based on an object query
> > language he invented,
> >
> > <CAPABILITY type="down-oql"
> >      query_uri="http://sanger.ac.uk/oql-search" />
> >
> > The Sanger and EBI clients might understand that and support
> > a more complex GUI, eg, with a text box interface.  Everyone
> > else must ignore unknown capability types.
>
> but this doesn't integrate with the existing query system
>
> > Then that would be POSTED (or whatever the protocol defines)
> > to the given URL, which returns back whatever results are
> > desired.
> >
> > Or the server can point to a public MySQL port, like
> >
> > <CAPABILITY type="mysql-connection"
> >      query_uri="mysql://username:password at hostname:port/databasename"
> > />
> >
> > That's what you are doing to bypass the syntax, except that
> > here it isn't a bypass; you can define the new interface in
> > the DAS sources document.
> >
> >> The generic language could just be some kind of simple
> >> extensible function syntax for search terms, boolean operators,
> >> and some kind of (optional) nesting syntax.
> >
> > Which syntax?  Is it supposed to be easy for people to write?
> > Text oriented?  Or tree structured, like XML, or SQL-like?
>
> I'd favour some concrete asbtract syntax that looks much like the
> existing DAS QL
>
> > And which clients and servers will implement that search
> > language?
>
> all servers. clients optional
>
> > If there was a generic language it would allow
> >    OR("on segment Chr1 between 1000 and 2000",
> >       "on segment ChrX between 99 and 777")
> > which is something we are expressly not allowing in DAS2
> > queries.  It doesn't make sense for the target applications
> > and by excluding it it simplifies the server development,
> > which means less chance for bugs.
>
> this example is pointless but it's easy to imagine plenty of ontology
> term queries or other queries in which this would be useful
>
> I guess I depart from the normal DAS philosophy - I don't see this
> being a high barrier for entry for servers, if they're not up to this
> it'll probably be a buggy hacky server anyway
>
> > Also, I personally haven't figured out a decent way to
> > do a GUI composition of a complex boolean query which is
> > as easy as learning the query language in the first place.
>
> doesn't mean it doesn't exist.
>
> i'm not sure what's hard about having say, a clipboard of favourite
> queries, then allowing some kind of drag-and-drop composition
>
> > A more generic language implementation is a lot of overhead
> > if most (80%? 90%) need basic searches, and many of the
> > rest can fake it by breaking a request into parts and
> > doing the boolean logic on the client side.
>
> this is always an option - if the user doesn't mind the additional
> possibly very high overhead. it's just a little bit of a depressing
> approach, as if Codd's seminal paper from 1970 or whenever it was never
> happened.
>
> > Feedback I've heard so far is that DAS1 queries were
> > acceptable, with only a few new search fields needed.
> >
> >> hmm, not sure how useful this would be - surely you'd want something
> >> more dasmodel-aware?
> >
> > The example I gave was a bad one.  What I meant was to show
> > how there's an extension point so someone can develop a new
> > search interface and clients can know that the new functionality
> > exists, without having to change the DAS spec.
>
> ok
>
> that's probably all I've got to say on the matter, sorry for being
> irksome. I guess I'm fundamentally missing something, that is, why wrap
> simple and expressive declarative query languages with limited ad-hoc
> constraint systems with consciously limited expressivity and limited
> means of extensibility..
>
> cheers
> chris
>
> > 					Andrew
> > 					dalke at dalkescientific.com
> >
> > _______________________________________________
> > DAS2 mailing list
> > DAS2 at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/das2
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From Gregg_Helt at affymetrix.com  Mon Mar 20 17:40:19 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 20 Mar 2006 09:40:19 -0800
Subject: [DAS2] call today?
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA47@msex02.affymetrix.com>

Apologies, I forgot to post that today's DAS/2 teleconference was
cancelled.  The feeling on Friday was that after the code sprint last
week we needed a break.  The teleconference will resume next week on the
regular schedule (Mondays at 9:30 AM Pacific time).

	Thanks,
	Gregg

> -----Original Message-----
> From: Andreas Prlic [mailto:ap3 at sanger.ac.uk]
> Sent: Monday, March 20, 2006 9:02 AM
> To: Andrew Dalke; Helt,Gregg
> Cc: Thomas Down
> Subject: call today?
> 
> Hi Dasians,
> 
> do we have a conference call today?
> 
> Cheers,
> Andreas
> 
>
-----------------------------------------------------------------------
> 
> Andreas Prlic      Wellcome Trust Sanger Institute
>                                Hinxton, Cambridge CB10 1SA, UK
> 			 +44 (0) 1223 49 6891


From cjm at fruitfly.org  Mon Mar 20 23:45:46 2006
From: cjm at fruitfly.org (chris mungall)
Date: Mon, 20 Mar 2006 15:45:46 -0800
Subject: [DAS2] query language description
In-Reply-To: <200603201232.41522.lstein@cshl.edu>
References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com>
	<c4433b247b29525254354103b60ce414@dalkescientific.com>
	<da80d7e4da69801f0a7b2b210fc66595@fruitfly.org>
	<200603201232.41522.lstein@cshl.edu>
Message-ID: <7900d1398d5045a268a5f6fe51af529d@fruitfly.org>


I guess things need to be left open for a DAS/3...

On Mar 20, 2006, at 9:32 AM, Lincoln Stein wrote:

> The current filter query language, which provides one level of ANDs 
> and a
> nested level of ORs, satisfies our use cases. It is not clear to me 
> what
> additional benefit we'll get from a composable query language. Note 
> that none
> of the popular and functional genome information sources -- NCBI, UCSC,
> Ensembl or BioMart -- offer a composable query language, and there 
> does not
> seem to be rioting on the streets!
>
> Lincoln
>
>
> On Friday 17 March 2006 19:20, chris mungall wrote:
>> On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote:
>>>> right now they are forced bypass the constraint language and go 
>>>> direct
>>>> to SQL.
>>>
>>> In addition, we provide defined ways for a server to indicate
>>> that there are additional ways to query the server.
>>
>> I was positing this as a bad feature, not a good one. or at least a
>> symptom of an incorrectly designed system (at least in the case of the
>> GO DB API - it may not carry forward to DAS - though if you're going 
>> to
>> allow querying by terms...)
>>
>>>> None of these really lit into the DAS paradigm. I'm guessing you 
>>>> want
>>>> something simple that can be used as easily as an API with get-by-X
>>>> methods but will seamlessly blend into something more powerful. I
>>>> think what you have is on the right lines. I'm just arguing to make
>>>> this language composable from the outset, so that it can be extended
>>>> to whatever expressivity is required in the future, without bolting 
>>>> on
>>>> a new query system that's incompatible with the existing one.
>>>
>>> We have two ways to compose the system.  If the simple query language
>>> is extended, for example, to support word searches of the text field
>>> instead of substring searches, then a server can say
>>>
>>> <CAPABILITY type="features"
>>> query_uri="http://somewhere.over.rainbow/server.cgi">
>>>    <SUPPORTS name="word-search"/>
>>> </CAPABILITY>
>>>
>>> This is backwards compatible, so the normal DAS queries work.  But
>>> a client can recognize the new feature and support whatever new 
>>> filters
>>> that 'word-search' indicates, eg
>>>
>>>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*
>>>
>>> (finds features with notes containing words starting with 'Andre' )
>>>
>>> These are composable.  For example, suppose Sanger allows 
>>> modification
>>> date searches of curation events.  Then it might say
>>>
>>> <CAPABILITY type="features"
>>> query_uri="http://somewhere.over.rainbow/server.cgi">
>>>    <SUPPORTS name="word-search"/>
>>>    <SUPPORTS name="sanger-curation"/>
>>> </CAPABILITY>
>>
>> so this is limited to single-argument search functions?
>>
>>> and I can search for notes containing words starting with "Andre"
>>> which were modified by "dalke" between 2002 and 2005 by doing
>>>
>>>    http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*&
>>>         modified-by=dalke&modified-before=2005&modified-after=2002
>>
>> but the compositionality is always associative since the CGI parameter
>> constraint forbids nesting
>>
>>> An advantage to the simple boolean logic of the current system
>>> is that the GUI interface is easy, and in line with existing
>>> simple search systems.
>>
>> there's nothing preventing you from implementing a simple GUI on top 
>> of
>> an expressive system - there is nothing forcing you to use the
>> expressivity
>>
>>> If someone wants to implement a new search system which is
>>> not backwards compatible then the server can indicate that
>>> alternative with a new CAPABILITY.  Suppose Thomas at Sanger
>>> comes up with a new search mechanism based on an object query
>>> language he invented,
>>>
>>> <CAPABILITY type="down-oql"
>>>      query_uri="http://sanger.ac.uk/oql-search" />
>>>
>>> The Sanger and EBI clients might understand that and support
>>> a more complex GUI, eg, with a text box interface.  Everyone
>>> else must ignore unknown capability types.
>>
>> but this doesn't integrate with the existing query system
>>
>>> Then that would be POSTED (or whatever the protocol defines)
>>> to the given URL, which returns back whatever results are
>>> desired.
>>>
>>> Or the server can point to a public MySQL port, like
>>>
>>> <CAPABILITY type="mysql-connection"
>>>      query_uri="mysql://username:password at hostname:port/databasename"
>>> />
>>>
>>> That's what you are doing to bypass the syntax, except that
>>> here it isn't a bypass; you can define the new interface in
>>> the DAS sources document.
>>>
>>>> The generic language could just be some kind of simple
>>>> extensible function syntax for search terms, boolean operators,
>>>> and some kind of (optional) nesting syntax.
>>>
>>> Which syntax?  Is it supposed to be easy for people to write?
>>> Text oriented?  Or tree structured, like XML, or SQL-like?
>>
>> I'd favour some concrete asbtract syntax that looks much like the
>> existing DAS QL
>>
>>> And which clients and servers will implement that search
>>> language?
>>
>> all servers. clients optional
>>
>>> If there was a generic language it would allow
>>>    OR("on segment Chr1 between 1000 and 2000",
>>>       "on segment ChrX between 99 and 777")
>>> which is something we are expressly not allowing in DAS2
>>> queries.  It doesn't make sense for the target applications
>>> and by excluding it it simplifies the server development,
>>> which means less chance for bugs.
>>
>> this example is pointless but it's easy to imagine plenty of ontology
>> term queries or other queries in which this would be useful
>>
>> I guess I depart from the normal DAS philosophy - I don't see this
>> being a high barrier for entry for servers, if they're not up to this
>> it'll probably be a buggy hacky server anyway
>>
>>> Also, I personally haven't figured out a decent way to
>>> do a GUI composition of a complex boolean query which is
>>> as easy as learning the query language in the first place.
>>
>> doesn't mean it doesn't exist.
>>
>> i'm not sure what's hard about having say, a clipboard of favourite
>> queries, then allowing some kind of drag-and-drop composition
>>
>>> A more generic language implementation is a lot of overhead
>>> if most (80%? 90%) need basic searches, and many of the
>>> rest can fake it by breaking a request into parts and
>>> doing the boolean logic on the client side.
>>
>> this is always an option - if the user doesn't mind the additional
>> possibly very high overhead. it's just a little bit of a depressing
>> approach, as if Codd's seminal paper from 1970 or whenever it was 
>> never
>> happened.
>>
>>> Feedback I've heard so far is that DAS1 queries were
>>> acceptable, with only a few new search fields needed.
>>>
>>>> hmm, not sure how useful this would be - surely you'd want something
>>>> more dasmodel-aware?
>>>
>>> The example I gave was a bad one.  What I meant was to show
>>> how there's an extension point so someone can develop a new
>>> search interface and clients can know that the new functionality
>>> exists, without having to change the DAS spec.
>>
>> ok
>>
>> that's probably all I've got to say on the matter, sorry for being
>> irksome. I guess I'm fundamentally missing something, that is, why 
>> wrap
>> simple and expressive declarative query languages with limited ad-hoc
>> constraint systems with consciously limited expressivity and limited
>> means of extensibility..
>>
>> cheers
>> chris
>>
>>> 					Andrew
>>> 					dalke at dalkescientific.com
>>>
>>> _______________________________________________
>>> DAS2 mailing list
>>> DAS2 at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/das2
>>
>> _______________________________________________
>> DAS2 mailing list
>> DAS2 at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/das2
>
> -- 
> Lincoln D. Stein
> Cold Spring Harbor Laboratory
> 1 Bungtown Road
> Cold Spring Harbor, NY 11724
> FOR URGENT MESSAGES & SCHEDULING,
> PLEASE CONTACT MY ASSISTANT,
> SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008)


From dalke at dalkescientific.com  Tue Mar 21 23:21:11 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 21 Mar 2006 15:21:11 -0800
Subject: [DAS2] complex features
Message-ID: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>

I've been working on the data model some, trying to get a feel
for complex features.  I've also been evaluating how GFF3 handles
them.

Both use a parent/child link, though GFF3 only has the reference
to the parent while DAS has both.  That means DAS clients can
determine when all of the complex feature have been downloaded.
GFF3 potentially requires waiting until the end of the library,
though there is a way to hint that all the results have been
returned.

Both allow complex graphs.  That is, both allow cycles.  I
assume we are restricting complex features to DAGs, but even
then the following is possible

   [root1]     [root2]     [root3]
     | \          |         /
     |  \         |        /
     |   ------------------
     |  |    node 4        |
     |   ------------------
     |  /
     | /
     |/
  [node 5]

Node 4 has three parents (root1, root2 and root3) and
node 5 has two parents (root1 and node4)

This may or may not make biological sense.  I don't know.  I
only point out that it's there.

I feel that complex annotations must only have a single root
element, even if it's a synthetic one with no location.

Next, consider writeback, with the following two complex features

  [root1]                  [root2]
   |    \                   |
   |      \                 |
   |        \               |
[node1.1]  [node1.2]     [node2.1]


Suppose someone adds a new "connector" node

                       >-->---.
                       |      V
  [root1]              |   [root2]
   |    \              |    |
   |      \            |    |
   |        \          ^    |
[node1.1]  [node1.2]  |  [node2.1]
     |                 |
     V                 |
   [connector]-->--->--^

Should that sort of thing be allowed?  What's the model
for the behavior?

It seems to me there's a missing concept in DAS relating to
complex features.  My model is that the "complex feature" is
its own concept, which  I've been calling an "annotation".
All simple features are annotations.  The connected nodes of
a complex feature are also annotations.

As such, two annotations cannot be combined like this.
Writeback only occurs at the annotation level, in that
new feature elements cannot be used to connect two existing
annotations.

We might also consider having a new interface for annotations
(complex features), so they can be referred to by URI.  I
don't think that's needed right now.


					Andrew
					dalke at dalkescientific.com


From cjm at fruitfly.org  Wed Mar 22 00:43:49 2006
From: cjm at fruitfly.org (chris mungall)
Date: Tue, 21 Mar 2006 16:43:49 -0800
Subject: [DAS2] complex features
In-Reply-To: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>
References: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>
Message-ID: <3879834dc8786f628c68e47a076c1e90@fruitfly.org>


The GFF3 spec says that Parent can only be used to indicate part_of 
relations. If we go by the definition of part_of in the OBO relations 
ontology, or any other definition of part_of (there are many), then 
cycles are explicitly verboten, although the GFF3 docs do not state 
this.

There's no reason in general why part_of graphs should have a single 
root, although it's certainly desirable from a software perspective. 
Dicistronic genes thow a bit of a spanner in the works. There's nothing 
to stop you adding a fake root, or refering to the maximally connected 
graph as an entity in its own right however.

I don't know enough about DAS/2 to be helpful with the writeback 
example. It looks like your example below is a gene merge.

On Mar 21, 2006, at 3:21 PM, Andrew Dalke wrote:

> I've been working on the data model some, trying to get a feel
> for complex features.  I've also been evaluating how GFF3 handles
> them.
>
> Both use a parent/child link, though GFF3 only has the reference
> to the parent while DAS has both.  That means DAS clients can
> determine when all of the complex feature have been downloaded.
> GFF3 potentially requires waiting until the end of the library,
> though there is a way to hint that all the results have been
> returned.
>
> Both allow complex graphs.  That is, both allow cycles.  I
> assume we are restricting complex features to DAGs, but even
> then the following is possible
>
>    [root1]     [root2]     [root3]
>      | \          |         /
>      |  \         |        /
>      |   ------------------
>      |  |    node 4        |
>      |   ------------------
>      |  /
>      | /
>      |/
>   [node 5]
>
> Node 4 has three parents (root1, root2 and root3) and
> node 5 has two parents (root1 and node4)
>
> This may or may not make biological sense.  I don't know.  I
> only point out that it's there.
>
> I feel that complex annotations must only have a single root
> element, even if it's a synthetic one with no location.
>
> Next, consider writeback, with the following two complex features
>
>   [root1]                  [root2]
>    |    \                   |
>    |      \                 |
>    |        \               |
> [node1.1]  [node1.2]     [node2.1]
>
>
> Suppose someone adds a new "connector" node
>
>> -->---.
>                        |      V
>   [root1]              |   [root2]
>    |    \              |    |
>    |      \            |    |
>    |        \          ^    |
> [node1.1]  [node1.2]  |  [node2.1]
>      |                 |
>      V                 |
>    [connector]-->--->--^
>
> Should that sort of thing be allowed?  What's the model
> for the behavior?
>
> It seems to me there's a missing concept in DAS relating to
> complex features.  My model is that the "complex feature" is
> its own concept, which  I've been calling an "annotation".
> All simple features are annotations.  The connected nodes of
> a complex feature are also annotations.
>
> As such, two annotations cannot be combined like this.
> Writeback only occurs at the annotation level, in that
> new feature elements cannot be used to connect two existing
> annotations.
>
> We might also consider having a new interface for annotations
> (complex features), so they can be referred to by URI.  I
> don't think that's needed right now.
>
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From boconnor at ucla.edu  Wed Mar 22 00:47:51 2006
From: boconnor at ucla.edu (Brian O'Connor)
Date: Tue, 21 Mar 2006 16:47:51 -0800
Subject: [DAS2] das.biopackages.net
Message-ID: <44209EB7.9070008@ucla.edu>

The DAS/2 server located at das.biopackages.net may be unavailable on 
and off for the next hour or so.  Just wanted to let everyone know in 
case someone is using it.

--Brian


From dalke at dalkescientific.com  Thu Mar 23 21:44:00 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 23 Mar 2006 13:44:00 -0800
Subject: [DAS2] complex features
In-Reply-To: <3879834dc8786f628c68e47a076c1e90@fruitfly.org>
References: <e0996f0489afa3170e718b5c538bcbf4@dalkescientific.com>
	<3879834dc8786f628c68e47a076c1e90@fruitfly.org>
Message-ID: <53840452abca7236130efd4e57f42aef@dalkescientific.com>

chris:
> The GFF3 spec says that Parent can only be used to indicate part_of 
> relations. If we go by the definition of part_of in the OBO relations 
> ontology, or any other definition of part_of (there are many), then 
> cycles are explicitly verboten, although the GFF3 docs do not state 
> this.

It looks like the most recent spec at
   http://song.sourceforge.net/gff3.shtml
does state this, although the earlier one did not:

   "A Parent relationship between two features that is not one of the
    Part-Of relationships listed in SO should trigger a parse exception
    Similarly, a set of Parent relationships that would cause a cycle
    should also trigger an exception."


> There's no reason in general why part_of graphs should have a single
> root, although it's certainly desirable from a software perspective.
> Dicistronic genes thow a bit of a spanner in the works. There's nothing
> to stop you adding a fake root, or refering to the maximally connected
> graph as an entity in its own right however.

I've been working with GFF3 data for a few days now, trying to
catch the different cases.  It isn't hard, but it had been a long
time since I worried about cycle detection.

The biggest problem has been keeping all the "could be a parent"
elements around until the entire data set is finished.  Except
for features with no ID and no Parents, parsers need to go to
the end of the file (or no-forward-references line) before
being able to do anything with the data.

In DAS it's easier because each feature lists all parents and
children, so it's possible to detect when a complex feature is
ready.  Even then it requires a bit of thinking to handle cases
with multiple roots.  It would be much easier if either all
complex features were in an element

   <COMPLEX-FEATURE>
    <FEATURE id="1" />
    <FEATURE id="2" />
   </COMPLEX-FEATURE>

or if there was a unique name to tie them together

    <FEATURE id="1" complex-feature-id="A"/>
    <FEATURE id="2" complex-feature-id="A"/>

Another solution is to make the problem simpler.  I see, for
example, that the biopython doesn't have any gff code and
the biojava one only works at the single feature level.  Only
bioperl implements a gff3 parser with support for complex features,
but it assumes all complex features are single rooted and that
the features are topologically sorted, so that parents come
before children.  It also looks like a diamond structure (single
root, two children, both with the same child) is supported on
input but the output assumes features are trees.

For example, I tried it just now on dmel-4-r4.3.gff from wormbase,
which I'm finding to be a bad example of what a GFF file should
look like.  It contains one duplicate ID, which bioperl catches
and dies on.  I fixed it. It then complains with a lot of

    MSG: Bio::SeqFeature::Annotated=HASH(0xba4a93c) is not contained
    within parent feature, and expansion is not valid, ignoring.

because the features are not topologically sorted, as in this
(trimmed) example.  The order is the same as in the file.

4  sim4:na_dbEST.same.dmel match_part  5175  5627 ...
                        Parent=88682278868229;Name=GH01459.5prime
4  sim4:na_dbEST.same.dmel match   5175    5627 ...
                       ID=88682278868229;Name=GH

The simpler the data model we use (eg, single rooted, output
must be topologically sorted with parents first) then the
more likely it is for client and server code to be correct and
the more likely there will be more DAS code.


					Andrew
					dalke at dalkescientific.com


From ap3 at sanger.ac.uk  Fri Mar 24 18:19:41 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Fri, 24 Mar 2006 18:19:41 +0000
Subject: [DAS2] 100th das1 source in registry
Message-ID: <23fe2aa8d3c4a9afc28782b3d3e58032@sanger.ac.uk>

Hi!

Today the 100th DAS1 source was registered in the DAS registration 
server at

http://das.sanger.ac.uk/registry/

It currently counts 101 DAS sources from 23 institutions in 9 countries.

The purpose of the DAS registration service is to keep track which DAS 
services are available
and to help with automated discovery of new DAS servers  on the client 
side.

Regards,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Gregg_Helt at affymetrix.com  Fri Mar 24 18:37:21 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Fri, 24 Mar 2006 10:37:21 -0800
Subject: [DAS2] 100th das1 source in registry
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA50@msex02.affymetrix.com>

Congratulations!

On a related note, is there a way to automatically register DAS/2
servers yet?  If not, can I send you info to add the Affymetrix test
DAS/2 server to the registry?

	Thanks,
	Gregg

> -----Original Message-----
> From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-
> bio.org] On Behalf Of Andreas Prlic
> Sent: Friday, March 24, 2006 10:20 AM
> To: DAS/2
> Subject: [DAS2] 100th das1 source in registry
> 
> Hi!
> 
> Today the 100th DAS1 source was registered in the DAS registration
> server at
> 
> http://das.sanger.ac.uk/registry/
> 
> It currently counts 101 DAS sources from 23 institutions in 9
countries.
> 
> The purpose of the DAS registration service is to keep track which DAS
> services are available
> and to help with automated discovery of new DAS servers  on the client
> side.
> 
> Regards,
> Andreas
> 
> 
>
-----------------------------------------------------------------------
> 
> Andreas Prlic      Wellcome Trust Sanger Institute
>                                Hinxton, Cambridge CB10 1SA, UK
> 			 +44 (0) 1223 49 6891
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From ap3 at sanger.ac.uk  Sat Mar 25 11:13:06 2006
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Sat, 25 Mar 2006 11:13:06 +0000
Subject: [DAS2] 100th das1 source in registry
In-Reply-To: <C71929195D04BF48BAECD499AF717B480198CA50@msex02.affymetrix.com>
References: <C71929195D04BF48BAECD499AF717B480198CA50@msex02.affymetrix.com>
Message-ID: <e34b6e6ad566f659e674f5d0f1dc176f@sanger.ac.uk>

> On a related note, is there a way to automatically register DAS/2
> servers yet?

A beta - version can be tried  at the toy-registry at
http://www.spice-3d.org/dasregistry/registerDas2Source.jsp

and the results will be visible at
http://www.spice-3d.org/dasregistry/das2/sources

- so far this provides a simple upload mechanism that is based on the 
sources decription.
what is still missing is a validation of the user provided data ("does 
this request give really a features response?")
plus other things like a html representation of the das2 servers.

  I think it would be great if Andrew's Dasypus server could provide an 
interface to the validation
mechanism that could be used by programs. If validation fails the 
response could contain
a link, to point the user to the nice error report web page.

will be abroad next week so can't join for the call...

Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891


From Gregg_Helt at affymetrix.com  Mon Mar 27 16:24:53 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 27 Mar 2006 08:24:53 -0800
Subject: [DAS2] Agenda for today's teleconference
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA53@msex02.affymetrix.com>

We're back on the standard DAS/2 teleconference schedule, every Monday
at 9:30 AM Pacific time.
 
Suggestions for today's agenda:
Code sprint summary
DAS/2 grant status
Writeback spec & implementation
???
 
Teleconference # US:   800-531-3250
            International:   303-928-2693
Conference ID: 2879055
Passcode: 1365
 

From Steve_Chervitz at affymetrix.com  Mon Mar 27 19:05:28 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Mon, 27 Mar 2006 11:05:28 -0800
Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 27 Mar 2006
Message-ID: <C04D7778.1D4FC%Steve_Chervitz@affymetrix.com>

Notes from the weekly DAS/2 teleconference, 27 Mar 2006

$Id: das2-teleconf-2006-03-27.txt,v 1.1 2006/03/27 19:03:30 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Gregg Helt
  CSHL: Lincoln Stein
  Dalke Scientific: Andrew Dalke
  UC Berkeley: Nomi Harris
  UCLA: Allen Day 
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


Proposed agenda:
 * Code sprint summary
 * DAS/2 grant status
 * Writeback spec & implementation


[Notetaker: missed the first 40min - apologies]


Topic: Code sprint summary
--------------------------

gh: pleased with our progress during the last code sprint (13-17 Mar)

[Notetaker: detailed summaries of what folks did during this code sprint
are described here:
http://lists.open-bio.org/pipermail/das2/2006-March/000668.html ]


Topic: Writeback 
----------------

[Discussion in progress]

ls: in my model, every feature has a unique id, when you update it,
it's going to make the change to the object and not create a new one.
the object is associated with url in some way, when you update the
position of this exon, it's going to change some attributes of it.

gh: thomas proposed the alternative: every time you change a feature
you create a new one with a pointer back to the old one.

ad: can't speak for what db implementers will do for versioning of
features. only taking about merging from different complex
features. So only when you merge from complex ones.

ls: this is the history tracking business. writeback will explicitly
support merges and splits.
ad: how detailed does the spec need to be?
ls: driven by requirements.
ad: what are the reqts? I can't go further without more details. roy
said eevery modification gets new version, so you could do time
travel, if your db supported that.

ls: does igb or apollo explicitly support merges and splits among
transcripts?
gh: yes. curation in igb is experimental (now turned off). but it does
support these. as does apollo. so these are essential.
ls: writeback should have instructions for how feature will adopt
children of a subfeature. one feature adopts children of the other and
previous feature is now deprecated. there's a specific set
of operations for creating new features, renaming, spliting, and merging.
perhaps Nomi should write down what operations that apollo supports.

nh: yes, all those are supported as well as things like adjusting
endpoints of start of translation.
apollo can merge transcripts within a gene and between genes
(which offers to merge the associated genes). curators can do
'splurge' - a split, merge combo.
ls: that sounds like suzi's nomenclature.

gh: the db that apollo writes back to, do changes create new versions
of feature or change the feature itself?
nh: not sure. mark did the work with chado. I know they were doing
something to rewrite the entire feature if anything changed.

[A] nomi will ask Mark to join in discussion next week (3 April).

aday: what fraction of the operations are doing simple vs complex
things? eg., revising the gene model.
nh: revision happens a lot. mostly adjusting endpoints. splits and
merges are infrequent. adding annotation. But it doesn't matter how
infrequent the operations are, we either support them or we don't.

ad: when there are changes in the model, how does the client get
notified that the change occurred?
nh: that's tricky.
gh: this is outside the scope of the das/2 spec itself. as long as we
have locks to prevent simultaneous modification, that is
sufficient.

ad: there's no mechanism for polling server.
gh: yes, just requery server.
gh: but your client doesn't do it.
gh: I'm thinking of adding polling to get the last modified stuff.
For now, one can simply re-start your session to see what has changed.

aday: is the portion of writeback spec for modifying endpoints, simple
add/delete of annotations stable?
ad: the general idea is unchanged.

gh: priority here is before next meeting: brian and allen read over
writeback spec and identify any issues as implementers.
aday: looking for an 80% solution. not dealing with heritance wihich
is difficult. 
nh: splits and merges can be done with combos of simpler ops.

aday: performace operations will be affected. graph flattening and
partial indexes. splits and merges will affect this table, so will
have to trigger update of that table any time there's a
split/merge. this will have big impact on query performance: could be
1-2 sec for yeast, 30-60 min for human.

gh: what about if you do that update 1x/day? Then users would be
working off a snapshot that was current as of the end of previous
day. 
aday: caching on server responses will also be affected, unless we
turn caching off. maybe I can tell apache to remove a subset of cached
pages and leave others intact.

aday: for tiling requests - server could find affected blocks and
purge those, instead of purging the entire cache.
gh: you can't rely on any client to use your tiling strategy. but
could be helpful for those clients that use it.
aday: basically we'll have to turn caching off when we start doing
writeback.
gh: is there a way for server to detect what has changed?
gh: if database detects change it can flush cache for that sequence.
aday: maybe. possibly the easiest way to do this is via tiling.

gh: say you have two servers:
   1) everthing that can be edited
   2) everything that has been edited (slower)
aday: main server has all features and second server handles
writeback, just writes to gff file, then cron runs once a night to
merge the gff into the db.

gh: separate dbs: 1) curation  2) everything that has been edited.
aday: yes. persistent flat file adapter can be used for one of them.
gh: this is the sort of detail I'm looking for w/r/t development of
the writeback spec.

[A] allen and brian look over writeback spec to discuss on 3 April.


From nomi at fruitfly.org  Mon Mar 27 19:42:59 2006
From: nomi at fruitfly.org (Nomi Harris)
Date: Mon, 27 Mar 2006 11:42:59 -0800
Subject: [DAS2] Mark Gibson on Apollo writeback to Chado
Message-ID: <ryi64lzoerg.fsf@spongecake.lbl.gov>

mark gibson said that he plans to attend next monday's DAS/2
teleconference.  he also gave me permission to forward this message that
he wrote recently in response to a group that is adapting apollo and
wondered what he thought about direct-to-chado writeback vs. the use of
chadoxml as an intermediate storage format.  FlyBase Harvard prefers to
use the latter approach because (we gather) they worry about possibly
corrupting the database by having clients write directly to it.  if
anyone from harvard is reading this and feels that mark has
misrepresented their approach, please set us straight!

               Nomi

On 10 March 2006, Mark Gibson wrote:
 > Im rather biased as a I wrote the chado jdbc adapter [for Apollo], but let me put forth my 
 > view of chado jdbc vs chado xml.
 > 
 > The chado Jdbc adapter is transactional, the chado xml adapter is not. What this 
 > means is jdbc only makes changes in the database that reflect what has actually 
 > been changed in the apollo session, like updating a row in a table; with chado 
 > xml you just get the whole dump. So if a synonym has been added jdbc will add a 
 > row to the synonym table. For xml you will get the whole dump of the region you 
 > were editing (probably a gene) no matter how small the edit.
 > 
 > What I believe Harvard/Flybase then does (with chado xml) is wipe out the gene 
 > from the database and reinsert the gene from the chado xml. The problem with 
 > this approach is if you have data in the db thats not associated with apollo 
 > (for flybase this would be phenotype data) then that will get wiped out as well, 
 > and there has to be some way of reinstating non-apollo data. If you dont have 
 > non-apollo data and dont intend on having it in the future this isnt a huge 
 > issue I suppose. I think Harvard is integrating non-apollo data into their chado 
 > database.
 > 
 > I think what they are going to do is actually figure out all of the transactions 
 > by comparing the chado xml with the chado database, which is what apollo already 
 > does, but I'm not sure as Im not so in touch with them these days (as Im not 
 > working with apollo these days - waiting for new grant to kick in).
 > 
 > Since the paradigm with chado xml is wipe out & reload, then apollo has to make 
 > sure it preserves every bit of the chado xml that came in. Theres a bunch of 
 > stuff thats in chado/chado xml that the apollo datamodel is unconcerned with, 
 > and has no need to be concerned with as its stuff that it doesnt visualize. In 
 > other words apollos data model is solely for apollos task of visualizing data, 
 > not for roundtripping what we call non-apollo data. In writing the chado xml 
 > adapter for FlyBase, Nomi Harris had a heck of a time with these issues, and she 
 > can elaborate on this I suppose.
 > 
 > I'm personally not fond of chado xml because its basically a relational database 
 > dump, so its extremely verbose. It redundantly has information for lots of joins 
 > to data in other tables - like a cvterm entry can take 10 or 20 lines of chado 
 > xml, and a given cvterm may be used a zillion times in a given chado xml file 
 > (as every feature has a cvterm). So these files can get rather large.
 > 
 > The solution for this verbose output is to use what I call macros in chado xml. 
 > Macros are supported by xort. They take the 15 line cvterm entry and reduce it 
 > to a line or 2 making the file size much more reasonable. The apollo chado xml 
 > adapter does not support macros, so you have to use unmacro'd chado xml for 
 > apollo purposes. Nomi Harris had a hard enough time getting the chado xml 
 > adapter working for flybase(and did a great job with a harrowing task), that she 
 > did not have time to take on the macro issue. If you wanted macros (and smaller 
 > file sizes) you would have to add this functionality to the chado xml adapter 
 > (are there java programmers in your group?).
 > 
 > One of the arguments against the jdbc adapter is that its dangerous because it 
 > goes straight into the database so if there are any bugs in the data adapter 
 > then the database could get corrupted - some groups find this a bit precarious. 
 > This is a valid argument. I think theres 2 solutions here. One is to thoroughly 
 > test the adapter out against a test database until you are confident that bugs 
 > are hammered out.
 > 
 > Another solution is to not go straight from apollo to the database. You can use 
 > an interim format and actually use apollo to get that interim format into the 
 > database. Of course one choice for interim format is chado xml and then you are 
 > at the the chado xml solution. The other choice for file format is GAME xml. You 
 > can then use apollo to load game into the chado database, and this can be done 
 > at the command line (with batching) so you dont have to bring up the gui to do 
 > it. Also chado xml can be loaded into chado via apollo as well (of course xort 
 > does this as well but not with transactions)
 > 
 > So then the question is if Im not going to go straight into the database, why 
 > would I choose game over chado xml?  Or if Im using chado xml should I use 
 > apollo or xort to load into chado. I think if you are using chado xml it makes 
 > sense to use xort as it is the tried & true technology for chado xml. The 
 > advantage of going through apollo is that it also uses the transactions from 
 > apollo (theres a transaction xml file) and thus writes back the edits in a 
 > transactional way as mentioned above rather than in a wipe out & reload fashion.
 > 
 > Also Game is a tried & true technology that has been used with apollo in 
 > production at flybase (before chado came along) for many years now. One 
 > criticism of it has been that DTD/XSD/schema has been a moving target, nor has 
 > it been described. That is not as true anymore. Nomi Harris has made a xsd for 
 > it as well as a rng. But I must confess that I have recently added the ability 
 > to have one level annotations in game (previously 1 levels had to be hacked as 3 
 > levels). Also game is a lot less verbose than un-macro'd chado xml, as it more 
 > or less fits with the apollo datamodel. One advantage of chado xml over game xml 
 > is that it is more flexible in terms of taking on features of arbitrary depth.
 > 
 > The chado xml adapter was developed for FlyBase and as far as I know has not 
 > been taken on by any other groups yet. Nomi can elaborate on this, but I think 
 > what this might mean is that there are places where things are FlyBase specific. 
 > If you went with chado xml the adapter would have to be generalized. Its a good 
 > exercise for the adapter to go through, but it will take a bit of work. Nomi can 
 > probably comment on how hard generalizing might be. I could be wrong about this 
 > but I think the current status with the chado xml adapter is that Harvard has 
 > done a bunch of testing on it but they havent put it into production yet.
 > 
 > The jdbc adapter is being used by several groups so has been forced to be 
 > generalized. One thing I have found is that chado databases vary all too much 
 > from mod to mod (ontologies change). There is a configuration file for the jdbc 
 > adapter that has settings for the differences that I encountered. I initially 
 > wrote it for cold spring harbors rice database that will be used in classrooms. 
 > Its working for rice in theory, but they havent actually used it much in the 
 > classroom yet. For rice the model is to save to game and use apollo command line 
 > to save game & transactions back to chado.
 > 
 > Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on the jdbc 
 > adapter for his group. I have cc'd him on this email as I think he will have a 
 > lot to say about the jdbc adapter. Cyril has uncovered many bugs and has fixed a 
 > lot of them (thank you cyril) as hes a very savvy java programmer. And he has 
 > also forced the adapter to generalize and brought about the evolution of the 
 > config file to adapt to chado differences. But as Cyril can attest (Cyril feel 
 > free to elaborate) it has been a lot of work to get jdbc working for him. There 
 > were a lot of bugs to fix that we both went after. Hopefully now its a bit more 
 > stable and the next db/mod wont have as many problems. I think Cyril is still at 
 > the test phase and hasn't gone into production (Cyril?)
 > 
 > Berkeley is using the jdbc adapter for an in house project. They are using the 
 > jdbc reader to load up game files (as the straight jdbc reader is slow as the 
 > chado db is rather slow) which are then loaded by a curator. They are saving 
 > game, and then I think chris mungall is xslting game to chado xml which is then 
 > saved with xort - or he is somehow writing game in another way - not actually 
 > sure. The Berkeley group drove the need for 1 level annotations(in jdbc,game,& 
 > apollo datmodel)
 > 
 > Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they use it there. I 
 > believe they are intending to use the write adapter but dont yet do so (Jonathan?).
 > 
 > I should mention that reading jdbc straight from chado tends to be slow, as I 
 > find that chado is a slow database, at least for Berkeley. It really depends on 
 > the db vendor and the amount of data. TIGRs reading is actually really zippy. 
 > The workaround for slow chados is to dump game files that read in pretty fast.
 > 
 > In all fairness, you should probably email with FlyBase (& Chris Mungall) and 
 > get the pros of using chado xml & xort, which they can give a far better answer 
 > on than I.
 > 
 > Hope this helps,
 > Mark


From dalke at dalkescientific.com  Mon Mar 27 20:59:28 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 27 Mar 2006 13:59:28 -0700
Subject: [DAS2] cell phone battery dead
Message-ID: <3d9298aced5c4efb7d9c34574fcf7618@dalkescientific.com>

Sorry about the drop out towards the end of today's conversation.
The battery on my phone died.

					Andrew
					dalke at dalkescientific.com