From dalke at dalkescientific.com  Mon Apr  3 03:20:59 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 3 Apr 2006 01:20:59 -0600
Subject: [DAS2] daylight saving time
Message-ID: <366941fb271add552809d50a50ab2027@dalkescientific.com>

For the non-US people involved in the next phone conference call,
the US just changed to daylight saving time so California is now
7 hours behind GMT instead of 8.  I think the UK switched a week
earlier than the US which is why people there couldn't make it
last week?

					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Mon Apr  3 12:53:17 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 3 Apr 2006 12:53:17 -0400
Subject: [DAS2] daylight saving time
In-Reply-To: <366941fb271add552809d50a50ab2027@dalkescientific.com>
References: <366941fb271add552809d50a50ab2027@dalkescientific.com>
Message-ID: <200604031253.17513.lstein@cshl.edu>

Hi Guys,

I'm stuck on another conf call right now. I'll be joining in 10 min.

Lincoln

On Monday 03 April 2006 03:20, Andrew Dalke wrote:
> For the non-US people involved in the next phone conference call,
> the US just changed to daylight saving time so California is now
> 7 hours behind GMT instead of 8.  I think the UK switched a week
> earlier than the US which is why people there couldn't make it
> last week?
>
>      Andrew
>      dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060403/2e21538e/attachment.bin>

From mgibson at bdgp.lbl.gov  Mon Apr  3 12:29:55 2006
From: mgibson at bdgp.lbl.gov (mark gibson)
Date: Mon, 3 Apr 2006 12:29:55 -0400
Subject: [DAS2] Mark Gibson on Apollo writeback to Chado
In-Reply-To: <ryi64lzoerg.fsf@spongecake.lbl.gov>
References: <ryi64lzoerg.fsf@spongecake.lbl.gov>
Message-ID: <FDF88AA5-51A8-497B-9A83-46BD4B75BC7D@fruitfly.org>

Ive attached a powerpoint presentation that is probably easier to  
glance at than reading through this whole email. The first half of it  
is about apollo transactions.

Mark

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmod-sri-13.ppt
Type: application/vnd.ms-powerpoint
Size: 599552 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060403/bcb3bd34/attachment.ppt>
-------------- next part --------------

On Mar 27, 2006, at 2:42 PM, Nomi Harris wrote:

> mark gibson said that he plans to attend next monday's DAS/2
> teleconference.  he also gave me permission to forward this message  
> that
> he wrote recently in response to a group that is adapting apollo and
> wondered what he thought about direct-to-chado writeback vs. the  
> use of
> chadoxml as an intermediate storage format.  FlyBase Harvard  
> prefers to
> use the latter approach because (we gather) they worry about possibly
> corrupting the database by having clients write directly to it.  if
> anyone from harvard is reading this and feels that mark has
> misrepresented their approach, please set us straight!
>
>                Nomi
>
> On 10 March 2006, Mark Gibson wrote:
>> Im rather biased as a I wrote the chado jdbc adapter [for Apollo],  
>> but let me put forth my
>> view of chado jdbc vs chado xml.
>>
>> The chado Jdbc adapter is transactional, the chado xml adapter is  
>> not. What this
>> means is jdbc only makes changes in the database that reflect what  
>> has actually
>> been changed in the apollo session, like updating a row in a  
>> table; with chado
>> xml you just get the whole dump. So if a synonym has been added  
>> jdbc will add a
>> row to the synonym table. For xml you will get the whole dump of  
>> the region you
>> were editing (probably a gene) no matter how small the edit.
>>
>> What I believe Harvard/Flybase then does (with chado xml) is wipe  
>> out the gene
>> from the database and reinsert the gene from the chado xml. The  
>> problem with
>> this approach is if you have data in the db thats not associated  
>> with apollo
>> (for flybase this would be phenotype data) then that will get  
>> wiped out as well,
>> and there has to be some way of reinstating non-apollo data. If  
>> you dont have
>> non-apollo data and dont intend on having it in the future this  
>> isnt a huge
>> issue I suppose. I think Harvard is integrating non-apollo data  
>> into their chado
>> database.
>>
>> I think what they are going to do is actually figure out all of  
>> the transactions
>> by comparing the chado xml with the chado database, which is what  
>> apollo already
>> does, but I'm not sure as Im not so in touch with them these days  
>> (as Im not
>> working with apollo these days - waiting for new grant to kick in).
>>
>> Since the paradigm with chado xml is wipe out & reload, then  
>> apollo has to make
>> sure it preserves every bit of the chado xml that came in. Theres  
>> a bunch of
>> stuff thats in chado/chado xml that the apollo datamodel is  
>> unconcerned with,
>> and has no need to be concerned with as its stuff that it doesnt  
>> visualize. In
>> other words apollos data model is solely for apollos task of  
>> visualizing data,
>> not for roundtripping what we call non-apollo data. In writing the  
>> chado xml
>> adapter for FlyBase, Nomi Harris had a heck of a time with these  
>> issues, and she
>> can elaborate on this I suppose.
>>
>> I'm personally not fond of chado xml because its basically a  
>> relational database
>> dump, so its extremely verbose. It redundantly has information for  
>> lots of joins
>> to data in other tables - like a cvterm entry can take 10 or 20  
>> lines of chado
>> xml, and a given cvterm may be used a zillion times in a given  
>> chado xml file
>> (as every feature has a cvterm). So these files can get rather large.
>>
>> The solution for this verbose output is to use what I call macros  
>> in chado xml.
>> Macros are supported by xort. They take the 15 line cvterm entry  
>> and reduce it
>> to a line or 2 making the file size much more reasonable. The  
>> apollo chado xml
>> adapter does not support macros, so you have to use unmacro'd  
>> chado xml for
>> apollo purposes. Nomi Harris had a hard enough time getting the  
>> chado xml
>> adapter working for flybase(and did a great job with a harrowing  
>> task), that she
>> did not have time to take on the macro issue. If you wanted macros  
>> (and smaller
>> file sizes) you would have to add this functionality to the chado  
>> xml adapter
>> (are there java programmers in your group?).
>>
>> One of the arguments against the jdbc adapter is that its  
>> dangerous because it
>> goes straight into the database so if there are any bugs in the  
>> data adapter
>> then the database could get corrupted - some groups find this a  
>> bit precarious.
>> This is a valid argument. I think theres 2 solutions here. One is  
>> to thoroughly
>> test the adapter out against a test database until you are  
>> confident that bugs
>> are hammered out.
>>
>> Another solution is to not go straight from apollo to the  
>> database. You can use
>> an interim format and actually use apollo to get that interim  
>> format into the
>> database. Of course one choice for interim format is chado xml and  
>> then you are
>> at the the chado xml solution. The other choice for file format is  
>> GAME xml. You
>> can then use apollo to load game into the chado database, and this  
>> can be done
>> at the command line (with batching) so you dont have to bring up  
>> the gui to do
>> it. Also chado xml can be loaded into chado via apollo as well (of  
>> course xort
>> does this as well but not with transactions)
>>
>> So then the question is if Im not going to go straight into the  
>> database, why
>> would I choose game over chado xml?  Or if Im using chado xml  
>> should I use
>> apollo or xort to load into chado. I think if you are using chado  
>> xml it makes
>> sense to use xort as it is the tried & true technology for chado  
>> xml. The
>> advantage of going through apollo is that it also uses the  
>> transactions from
>> apollo (theres a transaction xml file) and thus writes back the  
>> edits in a
>> transactional way as mentioned above rather than in a wipe out &  
>> reload fashion.
>>
>> Also Game is a tried & true technology that has been used with  
>> apollo in
>> production at flybase (before chado came along) for many years  
>> now. One
>> criticism of it has been that DTD/XSD/schema has been a moving  
>> target, nor has
>> it been described. That is not as true anymore. Nomi Harris has  
>> made a xsd for
>> it as well as a rng. But I must confess that I have recently added  
>> the ability
>> to have one level annotations in game (previously 1 levels had to  
>> be hacked as 3
>> levels). Also game is a lot less verbose than un-macro'd chado  
>> xml, as it more
>> or less fits with the apollo datamodel. One advantage of chado xml  
>> over game xml
>> is that it is more flexible in terms of taking on features of  
>> arbitrary depth.
>>
>> The chado xml adapter was developed for FlyBase and as far as I  
>> know has not
>> been taken on by any other groups yet. Nomi can elaborate on this,  
>> but I think
>> what this might mean is that there are places where things are  
>> FlyBase specific.
>> If you went with chado xml the adapter would have to be  
>> generalized. Its a good
>> exercise for the adapter to go through, but it will take a bit of  
>> work. Nomi can
>> probably comment on how hard generalizing might be. I could be  
>> wrong about this
>> but I think the current status with the chado xml adapter is that  
>> Harvard has
>> done a bunch of testing on it but they havent put it into  
>> production yet.
>>
>> The jdbc adapter is being used by several groups so has been  
>> forced to be
>> generalized. One thing I have found is that chado databases vary  
>> all too much
>> from mod to mod (ontologies change). There is a configuration file  
>> for the jdbc
>> adapter that has settings for the differences that I encountered.  
>> I initially
>> wrote it for cold spring harbors rice database that will be used  
>> in classrooms.
>> Its working for rice in theory, but they havent actually used it  
>> much in the
>> classroom yet. For rice the model is to save to game and use  
>> apollo command line
>> to save game & transactions back to chado.
>>
>> Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on  
>> the jdbc
>> adapter for his group. I have cc'd him on this email as I think he  
>> will have a
>> lot to say about the jdbc adapter. Cyril has uncovered many bugs  
>> and has fixed a
>> lot of them (thank you cyril) as hes a very savvy java programmer.  
>> And he has
>> also forced the adapter to generalize and brought about the  
>> evolution of the
>> config file to adapt to chado differences. But as Cyril can attest  
>> (Cyril feel
>> free to elaborate) it has been a lot of work to get jdbc working  
>> for him. There
>> were a lot of bugs to fix that we both went after. Hopefully now  
>> its a bit more
>> stable and the next db/mod wont have as many problems. I think  
>> Cyril is still at
>> the test phase and hasn't gone into production (Cyril?)
>>
>> Berkeley is using the jdbc adapter for an in house project. They  
>> are using the
>> jdbc reader to load up game files (as the straight jdbc reader is  
>> slow as the
>> chado db is rather slow) which are then loaded by a curator. They  
>> are saving
>> game, and then I think chris mungall is xslting game to chado xml  
>> which is then
>> saved with xort - or he is somehow writing game in another way -  
>> not actually
>> sure. The Berkeley group drove the need for 1 level annotations(in  
>> jdbc,game,&
>> apollo datmodel)
>>
>> Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they  
>> use it there. I
>> believe they are intending to use the write adapter but dont yet  
>> do so (Jonathan?).
>>
>> I should mention that reading jdbc straight from chado tends to be  
>> slow, as I
>> find that chado is a slow database, at least for Berkeley. It  
>> really depends on
>> the db vendor and the amount of data. TIGRs reading is actually  
>> really zippy.
>> The workaround for slow chados is to dump game files that read in  
>> pretty fast.
>>
>> In all fairness, you should probably email with FlyBase (& Chris  
>> Mungall) and
>> get the pros of using chado xml & xort, which they can give a far  
>> better answer
>> on than I.
>>
>> Hope this helps,
>> Mark
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From lstein at cshl.edu  Thu Apr  6 16:08:30 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu, 6 Apr 2006 16:08:30 -0400
Subject: [DAS2] Global IDs for worm
Message-ID: <200604061608.32914.lstein@cshl.edu>

I've created a directory in the das CVS under das2/GlobalSeqIDs/ to hold text 
files describing sequence IDs for common organisms. Currently I've created 
one for Worm. My schedule for the others is:

	Drosophilids
	Yeast
	Human
	Mouse

Drosophila is the difficult one because there are many partial sequences. I 
may just do melanogaster for now.

Lincoln


-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From dalke at dalkescientific.com  Mon Apr 10 00:24:24 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Sun, 9 Apr 2006 22:24:24 -0600
Subject: [DAS2] was ill
Message-ID: <0436e7cb5802c65cbce1a757a2a31b2f@dalkescientific.com>

Hi all,

   The reason you haven't heard from me in the last week is
I was quite ill with an upper respiratory virus, which you
heard a bit of in last week's phone conference.  I was barely
able to read a paragraph at a time, much less write anything
coherent.  It broke yesterday afternoon and I'm able to
work now.

   Strangest part was on Friday night when I dreamed about
parsing RSS feeds and every time I tried to get element [0]
I would wake up coughing.  That's some virus!

					Andrew
					dalke at dalkescientific.com


From Gregg_Helt at affymetrix.com  Mon Apr 10 13:19:23 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 10 Apr 2006 10:19:23 -0700
Subject: [DAS2] Problem with DAS/2 registry?
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA85@msex02.affymetrix.com>

I've been trying to reach the DAS/2 registry at:
http://www.spice-3d.org/dasregistry/das2/sources

which used to work, but now I'm getting this error message:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET?/dasregistry/das2/sources.
Reason: Could not connect to remote machine: Connection refused
Apache/1.3.33 Server at www.spice-3d.org Port 80

Any idea what the problem is?

	Thanks,	
	Gregg


From dalke at dalkescientific.com  Fri Apr 14 04:29:46 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 14 Apr 2006 02:29:46 -0600
Subject: [DAS2] alignments
Message-ID: <5dd5ce9d6d6e977e56c7b4e30e622f7c@dalkescientific.com>

I need a bit of help here.  I'm trying to hand-write an example of a
feature based on an alignment.  Let's assume these are annotations on
fly and it's aligned to human.  There's a hit from

fly chromosome 4
   http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4
   range 100:200

to human chromosome 8
   http://www.ensembl.org/Homo_sapiens/Chr1
   range 200:300

Assume the CIGAR string of the match is
    51 identical, 3 insertions, 24 identical, 3 deletions, 25 identical

Here's the best I can manage:

   <FEATURES xmlns="http://biodas.org/document/das2/">
    <FEATURE uri="feature/00094" type="type/alignment" title="Human  
genome alignment">
     <LOC  
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
        range="100:200" cigar="?????"/>
    </FEATURE>
   </FEATURE>


First question:
   Where do I put the object to which the alignment aligns?  Will
it be a segment or a feature?  Now, I could have this completely wrong
and DAS2 is not meant for genome/genome alignments like this.  If
that's the case please offer an example of how to write an alignment.


Second question:
   What's the format of the CIGAR string?  Lincoln's text pointed to
      http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html

That documentation says:
> The format starts with the same 9 fields as sugar output (see above),  
> and is followed by a series of <operation, length> pairs where  
> operation is one of match, insert or delete, and the length describes  
> the number of times this operation is repeated.

However, it does not list the operation characters nor if there are  
spaces
between the fields.  I assume it is "M 51 I 3 M 24 D 3 25 I", though  
perhaps
without spaces.

The GFF3 documentation at http://song.sourceforge.net/gff3.shtml refers  
to
  http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl
but I can find no relevant documentation there.

I then found a comment by Richard Durbin from two years ago, at
    
http://portal.open-bio.org/pipermail/bioperl-l/2003-February/ 
011234.html

> 3) I'm not convinced by the format for the Align string.  This requires
> a character per aligned base.  There are a variety of run-length type
> encodings in common use that are much more compact.  e.g. Ensembl uses  
> a
> string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8
> match, then 3 insert, then 15 match".  They call this CIGAR, but when I
> talked to Guy Slater, who invented CIGAR for exonerate, his version is
> subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see
> http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
> Jim Kent also has something like this.  I'd prefer us to standardise on
> one of these formats, all of which are very short for ungapped matches.

Which is the CIGAR string format DAS2 supports?  Where is the
documentation for it?


					Andrew
					dalke at dalkescientific.com


From aloraine at gmail.com  Fri Apr 14 20:05:17 2006
From: aloraine at gmail.com (Ann Loraine)
Date: Fri, 14 Apr 2006 19:05:17 -0500
Subject: [DAS2] question regarding most up-to-date D. melanogaster DAS?
Message-ID: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>

Hi,

I'm helping a colleague with an eQTL study and need to do a
region-based query on the most up-to-date fruit fly annotations.

Our markers (for influential loci in the study) are mapped to
cytological bands. Is it possible to run region-based queries using
cytological coordinates? (e.g., 30A - 30B, inclusive) My goal is to
find all candidate genes under those peaks.

I also have (approximate) mappings of cytological bands onto the
physical (genomic coordinates) map of Drosophila, so, if necessary, I
could use  those to collect the genes mapping to those locations.

Which fruit fly DAS server would provide the most up-to-date
information? If you have other recommendations for how to proceed, I
would be grateful for your help!

All the best,

Ann

--
Ann Loraine
Assistant Professor
Section on Statistical Genetics
University of Alabama at Birmingham
http://www.ssg.uab.edu
http://www.transvar.org


From dalke at dalkescientific.com  Mon Apr 17 02:54:30 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 00:54:30 -0600
Subject: [DAS2] updated spec
Message-ID: <b8dd7d0a7c25ec41e739b483ad29f7a0@dalkescientific.com>

Spec writing is like working on a dissertation.  Here's an
example, in the form of a text adventure

http://acephalous.typepad.com/acephalous/2006/04/disadventure.html

 > look laptop
There seems to be a dissertation chapter on the laptop.
 > read chapter
It is long-winded and boring.  You do not want to read it.
 > read chapter
It is obnoxious.  You hate it.
 > read book
Read.  There is a book underneath it that concerns a related topic.
 > read book
Read.  There is a book underneath it that concerns a related topic.
 > work on dissertation
You spend two hours searching the OED for the usage history of the word 
devolve.
 > work on dissertation
You spend three hours reading five articles which have nothing to do 
with the dissertation.
 > work on dissertation
You spend twenty minutes online reading about baseball.
   ...
 > work on dissertation
You spend five minutes playing online poker.
 > work on dissertation
You pick your nose.
 > work on dissertation
You go to the kitchen and eat cheese.
 > work on dissertation
The Mets are on.  It should be a good game.

Anyway, I've gone through the das/das2/draft3/spec.txt document and
updated everything (well, not writeback.  I'm going to need more 
cheese.)

Next is to get feedback, validate my inline examples, and convert the
behemoth into HTML, to replace what's on the web site.  Finally.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Mon Apr 17 03:31:13 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 01:31:13 -0600
Subject: [DAS2] outstanding questions
Message-ID: <ff4b778eb025c637263ed2d0831b2420@dalkescientific.com>

These are culled from the current draft of the spec.  I used "XXX"
to denote regions where I had questions.

1) type ontology URI

The TYPE elements have an 'ontology' attribute.  This is supposed
to be a required element, which is the URI of the corresponding
ontology term.

At present there is no URI system for ontology.  We added a special
'accession' attribute which is the GO id, as in

       so_accession="SO:0000704"

This was meant to be a hack for the hackathon.

My thought is:
   - keep the GO accession (as an optional attribute)
   - make 'ontology' be an optional attribute, but one of 'ontology'
       or 'so_accession' is required

Also, should that be "SO:0000704" or simply "0000704" ?  I think
the "SO:" should be present.


2)  Feature strand.

I want to make sure this is correct

   1 for positive
  -1 for negative
   0 for unknown
   not given for both strands or does not have meaning

3)  taxid

The 'taxid' in  the SOURCE element does not appear to be useful.
It's written

   <SOURCE uri="volvox" title="Volvox Database" writeable="no"
       doc_href="http://www.example.org/volvox_db.pdf" taxid="3066">

     <VERSION uri="volvox/build_1" title="Build 1, October 2002"
            created="2002-10-15" modified="2002-10-25T09:56:23">

       <COORDINATES uri="http://ncbi.nlm.nih.gov/das-genomes/human-35"
                    taxid="3066" source="chromosome" authority="NCBI" 
version="35" />

       <COORDINATES uri="http://embl.ebi.ac.uk/genome/volvox-clone"
                    taxid="2034" source="clone" authority="EMBL" />

Notice how the taxid exists in the SOURCE element and the COORDINATES
element (and how there are difference taxids for each COORDINATES)?

I think we can drop 'taxid' from the SOURCE element and if it's
important someone should have a COORDINATES element.

4)  'writeable'

The versioned source element contains the attribute "writeable", as in

     <VERSION uri="volvox/build_1" title="Build 1, October 2002"
            writeable="no" created="2002-10-15" 
modified="2002-10-25T09:56:23">

Do we need that 'writeable' attribute?  It seems that if there's a
writeback capability then then versioned source is writeable.

5) content-type for FASTA records

"text/plain", "text/x-fasta" or "chem/x-fasta"

Looking around now I also see "application/x-fasta" and 
"application/fasta".

I'm going to say "should be text/x-fasta but may be text/plain".
Objections?

6) response document too large

I've described that a server may return an error if the response 
document
is too large.  This means a client may try again, hopefully making a
request which returns a smaller document.

My question is, how does a client make a smaller request?  What if the
server decides that sending more than 5 features at a time is too much?
When does the client just give up and say the server implementation is
crazy?

7) styles

Are we going to go with the current style system or some other
approach?

The DAS1 styles had support for limited semantic zooming, with
options for "high", "medium" and "low" resolution.  What do those
mean?  When should a client choose one over another?

What does "height" mean for a glyph?  How do the glyph and text
interoperate?  Eg, is the "height" the height for both, or just
for the glyph?

Should style information be moved outside of the DAS2 exchange
spec?


8) the "count" format

We talked about, and people wanted, a "count" format.  This returns
the number of features which would be returned in a query.

Does it really return the number of features, or does it return the
number of complex annotations (eg, if there is a complex annotation
with a root and two children, is that a count of "1" or a count of "3"?
Given the way we've done things, I'm going with "3".)

9) alignments

How do I write an alignment?  Please give an example - I can't
figure it out.

10) CIGAR string

What's the format of the CIGAR string?  I've found two main
variations.  They are
   M 40 I 1 M 12 D 4
   40M1I12M4D

The latter appears to be the most common.  However, I did see one
case where if no count is given "1" is implied, so the latter can
also be written
   40MI12M4D

10) Do we need a REGION element?  I've written

   All feature locations are given in coordinates on a segment.  Some
   features may be locatable on other features.  For example, a contig
   feature may be locatable on a supercontig.  This relationship is
   stored using a REGION element.  A FEATURE element has zero or more
   REGION elements.  The 'feature' attribute of the REGION element
   contains the URI of the parent feature, on which the current feature
   is located.  A REGION record has an optional 'range' attribute.  If
   not given the feature is on the entire parent feature.  The range
   string is the same syntax and meaning as in the LOC record.

   XXX I think this is overkill - what are some good examples of use;
   perhaps when the global coordinates are not well-defined?.  Are
   negative coordiantes important, like "promoter region is 20 bases
   upstream from some gene"?  Does this need a CIGAR string too? XXX

   For example, suppose feature A is 6 bases long and is on chromosome 5
   at position 10000, on exon X at position 300 and on contig K at
   position 7.  The FEATURE record for this feature may be as follows:

   <?xml version="1.0" encoding="UTF-8"?>
   <!-- XXX fix this -->
   <FEATURES  xmlns="http://www.biodas.org/documents/das2"
        xml:base="http://www.biodas.org/das2/sequence/volvox/v3/">
     <FEATURE uri="feature/A" type="type/Type_A">
       <LOC segment="segment/5" range="10000:10006">

       <REGION id="feature/exon_X" range="300:306" />
       <REGION id="feature/contig_K" range="7:13" />
     </FEATURE>
   </FEATURES>


11) XID

Currently the XID element has a single attribute, 'href'.  I wrote

   A FEATURE has zero or more XID elements linking the feature record to
   an external database entry.  XXX This is not well-thought out.  I
   think it should have:
     'uri' -- a URL or LSID
     'authority' -- the name of the database (controlled vocabulary)
     'type' -- 'primary', 'accession', or possibly others?
     'id' -- the actual identifier
     'description' -- a paragraph or so describing the link, for humans
        to see why they might want to look into a link
   This has to be a well-defined concept.  Let's steal from someone else.
   The use-case here is to link to sequence records in other databases
   and to link to PubMed or other bibliographic databases.

12)  complex features

In the spec I wrote

   Some features are complex and cannot easily be modeled with a single
   feature record.  Quoting from the "Chado Schema Documentation" XXX
   give hyperlink XXX

      The class of transplicing events that involve ligating transcripts
      from different loci into a mature mRNA requires a separate feature
      to represent each locus transcript and one to represent the fused
      transcript. The fragments are located on the fused transcript;
      portions of the fused transcript can also be located on the genome.

Is this a relevant example of a complex feature for DAS2?  If not,
give another example.


In general I'm having a hard time coming up with good examples of
various forms of complex features.  I just don't know the domain
well enough.

13) "root" attribute

I proposed that features have a new, optional attribute called "root".  
If
a feature is part of a complex annotation then the "root" attribute must
be present and it must have the URI of the root feature for the 
annotation.

This makes client processing easier, though it is not needed in the
purest of senses.

14) features have a 'STYLE' element

The idea was that an individual feature could override the style
given in the feature type record.  I don't think that's useful
and/or we need a real stylesheet instead.  I'm going to drop the
STYLE element from the FEATURE element unless there is objection.

15) In text searches we've defined

     ABC  -- field exactly matches "ABC"
    *ABC  -- field ends with "ABC"
     ABC* -- field starts with "ABC"
    *ABC* -- field contains the substring "ABC"

I want to say that using "*" and "?" elsewhere in the query string
is implementation dependent.  That is, "A*B" might match everything
with an A followed by a B or it might match the exact string "A*B"
and only that string.

I did this because looking around at various tools it looks like
it might be hard to change the meaning of "*" and "?" for the
text searches.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Mon Apr 17 03:40:07 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 01:40:07 -0600
Subject: [DAS2] proposed April 17 agenda
Message-ID: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>

Gregg is taking the month off.  I volunteered to be in charge
of the next teleconference.

Here is what I would like to talk about:

  1. get additional agenda items

  2. status reports

  3. who maintains the list of reference names for different
       genomes (starting with the list Licoln developed)?

  4. resolve some questions with the spec (see my previous email)

  5. get a volunteer to come up with best-practices examples
       of how to represent various complex annotations

  6. writeback planning


					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Mon Apr 17 09:46:23 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 17 Apr 2006 09:46:23 -0400
Subject: [DAS2] alignments
In-Reply-To: <5dd5ce9d6d6e977e56c7b4e30e622f7c@dalkescientific.com>
References: <5dd5ce9d6d6e977e56c7b4e30e622f7c@dalkescientific.com>
Message-ID: <200604170946.24479.lstein@cshl.edu>

I didn't realize there were multiple things called CIGAR. I think we should 
use Ensembl CIGAR format.

The target of the alignment should be a segment, and not another feature.

Best,

Lincoln


On Friday 14 April 2006 04:29, Andrew Dalke wrote:
> I need a bit of help here.  I'm trying to hand-write an example of a
> feature based on an alignment.  Let's assume these are annotations on
> fly and it's aligned to human.  There's a hit from
>
> fly chromosome 4
>    http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4
>    range 100:200
>
> to human chromosome 8
>    http://www.ensembl.org/Homo_sapiens/Chr1
>    range 200:300
>
> Assume the CIGAR string of the match is
>     51 identical, 3 insertions, 24 identical, 3 deletions, 25 identical
>
> Here's the best I can manage:
>
>    <FEATURES xmlns="http://biodas.org/document/das2/">
>     <FEATURE uri="feature/00094" type="type/alignment" title="Human
> genome alignment">
>      <LOC
> segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
>         range="100:200" cigar="?????"/>
>     </FEATURE>
>    </FEATURE>
>
>
> First question:
>    Where do I put the object to which the alignment aligns?  Will
> it be a segment or a feature?  Now, I could have this completely wrong
> and DAS2 is not meant for genome/genome alignments like this.  If
> that's the case please offer an example of how to write an alignment.
>
>
> Second question:
>    What's the format of the CIGAR string?  Lincoln's text pointed to
>       http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html
>
> That documentation says:
> > The format starts with the same 9 fields as sugar output (see above),
> > and is followed by a series of <operation, length> pairs where
> > operation is one of match, insert or delete, and the length describes
> > the number of times this operation is repeated.
>
> However, it does not list the operation characters nor if there are
> spaces
> between the fields.  I assume it is "M 51 I 3 M 24 D 3 25 I", though
> perhaps
> without spaces.
>
> The GFF3 documentation at http://song.sourceforge.net/gff3.shtml refers
> to
>   http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl
> but I can find no relevant documentation there.
>
> I then found a comment by Richard Durbin from two years ago, at
>
> http://portal.open-bio.org/pipermail/bioperl-l/2003-February/
> 011234.html
>
> > 3) I'm not convinced by the format for the Align string.  This requires
> > a character per aligned base.  There are a variety of run-length type
> > encodings in common use that are much more compact.  e.g. Ensembl uses
> > a
> > string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8
> > match, then 3 insert, then 15 match".  They call this CIGAR, but when I
> > talked to Guy Slater, who invented CIGAR for exonerate, his version is
> > subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see
> > http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
> > Jim Kent also has something like this.  I'd prefer us to standardise on
> > one of these formats, all of which are very short for ungapped matches.
>
> Which is the CIGAR string format DAS2 supports?  Where is the
> documentation for it?
>
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From dalke at dalkescientific.com  Mon Apr 17 12:19:47 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 10:19:47 -0600
Subject: [DAS2] question regarding most up-to-date D. melanogaster DAS?
In-Reply-To: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>
References: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>
Message-ID: <3ecdfacc003d58cc93045bc7a4aefb57@dalkescientific.com>

Ann:
> Our markers (for influential loci in the study) are mapped to
> cytological bands. Is it possible to run region-based queries using
> cytological coordinates? (e.g., 30A - 30B, inclusive) My goal is to
> find all candidate genes under those peaks.

At present there is no way to do that.

A server can extend the query syntax to support searches in
cytological coordinates and add new feature elements to store
those coordinates.  I don't know enough about how people use
those coordinates to sketch an example.

					Andrew
					dalke at dalkescientific.com


From aloraine at gmail.com  Mon Apr 17 13:47:03 2006
From: aloraine at gmail.com (Ann Loraine)
Date: Mon, 17 Apr 2006 12:47:03 -0500
Subject: [DAS2] question regarding most up-to-date D. melanogaster DAS?
In-Reply-To: <3ecdfacc003d58cc93045bc7a4aefb57@dalkescientific.com>
References: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>
	<3ecdfacc003d58cc93045bc7a4aefb57@dalkescientific.com>
Message-ID: <83722dde0604171047r26a32986gaa4c3b34b6166c16@mail.gmail.com>

I'm not sure it would be worth adding more work to the project to
allow for these cases. If funding is renewed, then I think it would be
worth the effort. But for now, probably not, since it would be a new
feature. (At this stage, avoiding feature creep seems advisable :-)

I believe I can get a mapping of cytological bands onto genomic
coordinates from FlyBase. I don't know how reliable these mappings
are, but assuming they are okay, I can use them to query a fly DAS
site to get the genes in those coordinates. I'm not sure what is the
best DAS site to use for this, however.

-Ann

On 4/17/06, Andrew Dalke <dalke at dalkescientific.com> wrote:
> Ann:
> > Our markers (for influential loci in the study) are mapped to
> > cytological bands. Is it possible to run region-based queries using
> > cytological coordinates? (e.g., 30A - 30B, inclusive) My goal is to
> > find all candidate genes under those peaks.
>
> At present there is no way to do that.
>
> A server can extend the query syntax to support searches in
> cytological coordinates and add new feature elements to store
> those coordinates.  I don't know enough about how people use
> those coordinates to sketch an example.
>
>                                         Andrew
>                                         dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>


--
Ann Loraine
Assistant Professor
Section on Statistical Genetics
University of Alabama at Birmingham
http://www.ssg.uab.edu
http://www.transvar.org


From dalke at dalkescientific.com  Tue Apr 18 03:36:39 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 18 Apr 2006 01:36:39 -0600
Subject: [DAS2] proposed April 17 agenda
In-Reply-To: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
Message-ID: <d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>

Summary of today's conference call.

>  2. status reports

The biggest one is that the new version of IGB is out and
the Affy DAS server is available at
   http://netaffxdas.affymetrix.com/das2/sequence

Steve and Ed (as I recall) tracked down a problem with
that server which might affect other implementations.  The
problem is knowing the public/external URL for the DAS service.
In theory it can be determined by looking at various CGI
headers, but with things like an Apache rewrite and forwards
to the actual server it can get complicated.  The solution
seems to be either use relative links or have a configuration
option in the server specifying the base name.

Lincoln's been working on reference names.  Allen's been
working on how the writeback server might work.  I've been
working on the spec, and have not gone further with the
validator.

>  3. who maintains the list of reference names for different
>       genomes (starting with the list Licoln developed)?

Lincoln proposed, to broad acceptance, that we set up a
wiki page with the reference names.

The easiest way is to use the OBF wiki, at
   http://open-bio.org/wiki/Main_Page
because that is already set up.  I can ask the OBF about
the appropriateness of that - I think it's fine.

>  4. resolve some questions with the spec (see my previous email)

Here are the resolutions:

1) type ontology URI

I've emailed Suzi asking about plans for GO, the Gene Ontology
Consortium, whoever in coming up with standardized, public
ontology URLs.  Allen's cc'ed on it, and we'll discuss this
off the DAS list.

2)  Feature strand.

I stand corrected.  The definitions are

   1 for positive
  -1 for negative
   0 both strands
   not don't know or does not have meaning

3)  taxid

There seems to be no reason to keep the 'taxid' in the SOURCE
element.  We'll only have it in the COORDINATES element.

4) 'writeable'

We'll defer this (leaving it as-is) until we have the writeback
defined a bit better.

5) content-type for FASTA records

We'll recommend "text/x-fasta" or "text/plain" as the content-type
for FASTA responses.  There is no widely accepted community standard.

6) response document too large

There is no automatic way for a client to narrow its request.
This must be done by a person, depending on what the search
criteria are.  Servers should support large requests so that
this isn't a problem.

7) styles

We'll shift to using a stylesheet.  This will be listed in the
versioned source record as

   <CAPABILITY type="stylesheet" query_uri="blah_blah.xml" />

As a rough sketch the document will look like

<STYLE uri="http://url/for/feature/type"
    zoom="high" fgcolor="red" bgcolor="black">
  <BOX />
  <LABEL font_family="monospace">
</STYLE>

<STYLE uri="http://url/for/feature/type"
    zoom="medium" fgcolor="red" bgcolor="black">
  <LINE line_width="3px" />
  <LABEL font_family="monospace">
</STYLE>

<STYLE uri="http://url/for/feature/type"
    zoom="medium" fgcolor="red" bgcolor="black">
  <LINE line_width="1px" />
</STYLE>

The STYLE elements add a new "uri" attribute which is
the URI of the feature type being styled.

In theory this could also include the feature uri (to define
the style for a single feature) or an ontology uri (sets the
style for all features with that ontology term or its descendants).

However, with that comes problems of precedence.  If the
feature type and the feature and the ontology each have
styles, which one wins?  I think feature beats type beats ontology.
But I also think we can ignore this because no one has asked
for this sort of flexibility.

(More flexibility would be support for a query language selecting
which features, types, sources, ontologies, feature alias, etc.
should get a given style.  Not going there.  :)

8) the "count" format

This should be the number of feature elements returned, and not
the number of "annotations" (counting the multiple features of a
complex annotation as 1)

9) alignments

Lincoln will provide examples.

10) CIGAR string

We'll use the EBI style CIGAR strings, and the documentation will
be based on the GFF3 description at
   http://song.sourceforge.net/gff3.shtml

10.5) Do we need a REGION element?

No.  Deleted from the spec.

11) XID

On Ed's recommendation I'm looking at MAGE XML.  I am not a
good UML reader so it's slow going.  My view so far is that
what I sketched out is on the right track and we can simplify
things compared to MAGE, eg, we don't need full bibliographic
records.

The other idea is to defer finalizing this until people start
providing data with XIDs, so we know what's needed.

12)  complex features

Lincoln will come up with some examples.

13) "root" attribute

There are two changes here:
   - complex annotations must have a single root feature
   - all features which are in complex annotations must have
       a link to the root element

There's some worry about the first requirement, in that
some complex annotations may not have a "real" root.  I
argue that having a synthetic one is okay.  There were no
strong arguments against having a single root.

We decided to defer finalizing this until we have some
example of complex annotations.

14) features have a 'STYLE' element

no, they don't.

15) "*" and "?" in the query string

The proposal here is to say that the interpretation of
"*" other than at the start and/or end of the query
string is implementation defined, as is the use of "?".
It used to be that any other use of "*" must be treated
as an asterisks, so "***" finds all strings containing
a "*".

It looks like people are fine with this looseness.


>  5. get a volunteer to come up with best-practices examples
>       of how to represent various complex annotations

That's Lincoln.

>  6. writeback planning

Allen will take the implementation lead on this, funding
willing.  He's currently working on how to associate an
identifier with a new feature.

One thought is to progress in stages:
  - upload completely new features / complex annotations to the server
  - modify an existing feature, though not the parent/part relationship
      (eg, change the location)
  - delete a simple feature
  - delete a complex annotation
  - modify an existing complex annotation, or turn a simple feature
     into a complex annotation
  - do 'em all at once

The work will need to be server driven as the current clients
can't handle this before the end of the funding period.  The
clients will mostly be library code.

					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Mon Apr 24 08:35:21 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 24 Apr 2006 08:35:21 -0400
Subject: [DAS2] Not able to make it today
In-Reply-To: <d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
	<d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
Message-ID: <200604240835.21690.lstein@cshl.edu>

Hi All,

Due to wedding preparations I will be unable to attend the conference call 
today. I might or might not be able to make it next week (I'll be in Toronto) 
but I'll let you know in advance.

Best,

Lincoln

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/8ca83d87/attachment.bin>

From dalke at dalkescientific.com  Mon Apr 24 12:11:31 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 24 Apr 2006 10:11:31 -0600
Subject: [DAS2] April 24 meeting - cancel?
In-Reply-To: <200604240835.21690.lstein@cshl.edu>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
	<d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
	<200604240835.21690.lstein@cshl.edu>
Message-ID: <b9ffc2e4620834562e66f6a61a093f25@dalkescientific.com>

Hi all,

   I'm trying to come up with an agenda but I've done very little
the last week DAS related.  I've been working on selling my house.
Looks like this will be a short meeting, or should we just cancel?

Here's my status.

  - Sent mail to Suzi asking about URIs for ontologies.  Heard
      nothing from her yet.

  - Talked with the OBF people about setting up a wiki for the
reference names for the genomes/segments.  We decided to use the
OBF wiki for now and if there are enough pages we'll migrate over
to a biodas-specific wiki.  I'm about 1/2-way through, learning
wiki syntax.  I'll email when it's there.

   - I've migrated the spec 300 doc into CVS.  Just checked it
in.  There's still some formatting issues though.

   - started working on the stylesheet spec.  Should take another
3 hours or so.

   - haven't been able to log into cgi.biodas.org to restart the
validation server.

   - still need to write an rnc for the writeback for Allen

					Andrew
					dalke at dalkescientific.com


From allenday at ucla.edu  Mon Apr 24 12:29:09 2006
From: allenday at ucla.edu (Allen Day)
Date: Mon, 24 Apr 2006 09:29:09 -0700
Subject: [DAS2] April 24 meeting - cancel?
In-Reply-To: <b9ffc2e4620834562e66f6a61a093f25@dalkescientific.com>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
	<d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
	<200604240835.21690.lstein@cshl.edu>
	<b9ffc2e4620834562e66f6a61a093f25@dalkescientific.com>
Message-ID: <5c24dcc30604240929l7a882dd9qa15c0a51bd636cb0@mail.gmail.com>

Let's cancel it.  I have a database set up for writeback, and am able to
POST delta XML to the server.  I am still at the stage where I am parsing
the XML.  The DTD would be helpful.

See attached figure "writeback.png" for the current implementation track.  I
am at the "Parse XML" step in implementation.

See attached "vsourcecommand.png" for an overview of the previous writeback
plans as documented in the HTML docs, and "vsourcelock.png" for an overview
of lock plans as documented in the HTML docs.  Parts of these may at some
point be helpful for folding into the current implementation.

I can send or commit to CVS the source documents for any of these diagrams
if people would like to edit.

-Allen

On 4/24/06, Andrew Dalke <dalke at dalkescientific.com> wrote:
>
> Hi all,
>
>    I'm trying to come up with an agenda but I've done very little
> the last week DAS related.  I've been working on selling my house.
> Looks like this will be a short meeting, or should we just cancel?
>
> Here's my status.
>
>   - Sent mail to Suzi asking about URIs for ontologies.  Heard
>       nothing from her yet.
>
>   - Talked with the OBF people about setting up a wiki for the
> reference names for the genomes/segments.  We decided to use the
> OBF wiki for now and if there are enough pages we'll migrate over
> to a biodas-specific wiki.  I'm about 1/2-way through, learning
> wiki syntax.  I'll email when it's there.
>
>    - I've migrated the spec 300 doc into CVS.  Just checked it
> in.  There's still some formatting issues though.
>
>    - started working on the stylesheet spec.  Should take another
> 3 hours or so.
>
>    - haven't been able to log into cgi.biodas.org to restart the
> validation server.
>
>    - still need to write an rnc for the writeback for Allen
>
>                                         Andrew
>                                         dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writeback.png
Type: image/png
Size: 41093 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/eca77d9b/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vsourcelock.png
Type: image/png
Size: 91466 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/eca77d9b/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vsourcecommand.png
Type: image/png
Size: 49552 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/eca77d9b/attachment-0002.png>

From dalke at dalkescientific.com  Mon Apr 24 13:39:29 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 24 Apr 2006 11:39:29 -0600
Subject: [DAS2] sequence names on wiki
Message-ID: <6e4986bba9736f1c43f239646b8a22d4@dalkescientific.com>

I've imported Lincoln's list of global sequence identifiers onto
the open-bio wiki at

    http://open-bio.org/wiki/DAS:GlobalSeqIDs


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Apr 27 03:33:29 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 27 Apr 2006 01:33:29 -0600
Subject: [DAS2] writeback spec
Message-ID: <abc84b2d3ea28d5da1018ebe7a382b12@dalkescientific.com>

I've written up a draft of the writeback spec.  It's in CVS.

   das/das2/das2_writeback.html
with the RNC in
   das/das2/writeback.rnc -- for the writeback document
   das/das2/mapping.rnc -- for the mapping from old URLs to new

On the question of how to handle new records, which need
new identifiers, I decided to go with the private identifier
scheme.  The client uses "das-private:0000" where the "0000"
is alphanumeric and 1 up to 20 characters long.  The server
responds with a mapping document which looks like

<MAPPING>
  <MAP from="das-private:0000" 
to="http://blah.com/das2/whatever/feature/123" />
</MAPPING>

I decided on this instead of the "preallocate identifier"
scheme because this requires less state on the server
(it doesn't need to remember which identifiers were already
issued) and because it supports versioning servers better.


Is the web site being updated from CVS?  I see it hasn't gotten
the updates I made on Monday.


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Thu Apr 27 13:34:12 2006
From: Steve_Chervitz at affymetrix.com (Chervitz, Steve)
Date: Thu, 27 Apr 2006 10:34:12 -0700
Subject: [DAS2] writeback spec
In-Reply-To: <abc84b2d3ea28d5da1018ebe7a382b12@dalkescientific.com>
Message-ID: <C0764EA4.1E0A6%Steve_Chervitz@affymetrix.com>


Andrew,

> From: Andrew Dalke <dalke at dalkescientific.com>
> Date: Thu, 27 Apr 2006 01:33:29 -0600
> To: DAS/2 <das2 at lists.open-bio.org>
> Subject: [DAS2] writeback spec
> 
> I've written up a draft of the writeback spec.  It's in CVS.

Great. Thanks.

> 
> Is the web site being updated from CVS?  I see it hasn't gotten
> the updates I made on Monday.

You mean in some automated fashion? Before we switched to generating the
html from templates, I set up a cron that updated the manually edited html
file for the read spec on biodas.org. I don't know if there is an automated
process that produces the template-based html from CVS on biodas.org --
unless you or Lincoln set something up.

BTW, I can't ssh into portal.open-bio.org, or even ping it. This is (or
perhaps was) the machine hosting biodas.org. Do you the story here?

Steve


From dalke at dalkescientific.com  Thu Apr 27 13:55:55 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 27 Apr 2006 11:55:55 -0600
Subject: [DAS2] writeback spec
In-Reply-To: <C0764EA4.1E0A6%Steve_Chervitz@affymetrix.com>
References: <C0764EA4.1E0A6%Steve_Chervitz@affymetrix.com>
Message-ID: <a5293a6646ae90c170dd238123e84027@dalkescientific.com>

Steve:
> You mean in some automated fashion? Before we switched to generating 
> the
> html from templates, I set up a cron that updated the manually edited 
> html
> file for the read spec on biodas.org. I don't know if there is an 
> automated
> process that produces the template-based html from CVS on biodas.org --
> unless you or Lincoln set something up.

I didn't set anything up.  One thing to note though is that I'm not 
using
the template system for the current specs.  The validator I have now is
much more powerful than the one then so I'm parsing the spec documents
and validating them.  "More powerful" includes that I can report the
error line as it is in the spec document and not just in the piece
of XML to validate.

It should be possible to just pull the specs out of CVS.

> BTW, I can't ssh into portal.open-bio.org, or even ping it. This is (or
> perhaps was) the machine hosting biodas.org. Do you the story here?

Chris Dag. sent out an email on 3/23 "Important news for all developers
ith open-bio.org CVS access

    (2) All of our websites have been consolidated on the new server
    newportal.open-bio.org


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Thu Apr 27 14:09:09 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 27 Apr 2006 11:09:09 -0700
Subject: [DAS2] writeback spec
In-Reply-To: <a5293a6646ae90c170dd238123e84027@dalkescientific.com>
Message-ID: <C07656D5.1E0B6%Steve_Chervitz@affymetrix.com>


Andrew:

> I didn't set anything up.  One thing to note though is that I'm not
> using
> the template system for the current specs.  The validator I have now is
> much more powerful than the one then so I'm parsing the spec documents
> and validating them.  "More powerful" includes that I can report the
> error line as it is in the spec document and not just in the piece
> of XML to validate.
> 
> It should be possible to just pull the specs out of CVS.

Cool. I can look into updating my cronjob to grab the new specs.
 
> Steve:
>> BTW, I can't ssh into portal.open-bio.org, or even ping it. This is (or
>> perhaps was) the machine hosting biodas.org. Do you the story here?
> 
> Chris Dag. sent out an email on 3/23 "Important news for all developers
> ith open-bio.org CVS access
> 
>     (2) All of our websites have been consolidated on the new server
>     newportal.open-bio.org

Yep. Just realized that. At the moment, I can't access my account on this
new server. Probably my password got reset. I've got a support request in.

Steve


From Steve_Chervitz at affymetrix.com  Thu Apr 27 15:16:23 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 27 Apr 2006 12:16:23 -0700
Subject: [DAS2] writeback spec
In-Reply-To: <abc84b2d3ea28d5da1018ebe7a382b12@dalkescientific.com>
Message-ID: <C0766697.1E0D1%Steve_Chervitz@affymetrix.com>

OK, Andrew's writeback spec is now accessible at:

http://www.biodas.org/documents/das2/das2_writeback.html

Be sure to refresh your browsers to get the latest spec at
http://biodas.org/documents/das2/das2_protocol.html

I re-established my cronjob to update all the documents in this das2
directory twice daily (00:01 and 12:01 East coast time).

This das2 directory is a new cvs checkout. I moved the previous das2
directory to das2.old, in case it contains anything we might need that isn't
in CVS (accessible via http://www.biodas.org/documents/das2.old/ ).

Steve

> From: Andrew Dalke <dalke at dalkescientific.com>
> Date: Thu, 27 Apr 2006 01:33:29 -0600
> To: DAS/2 <das2 at lists.open-bio.org>
> Subject: [DAS2] writeback spec
> 
> I've written up a draft of the writeback spec.  It's in CVS.
> 
>    das/das2/das2_writeback.html
> with the RNC in
>    das/das2/writeback.rnc -- for the writeback document
>    das/das2/mapping.rnc -- for the mapping from old URLs to new
> 
> On the question of how to handle new records, which need
> new identifiers, I decided to go with the private identifier
> scheme.  The client uses "das-private:0000" where the "0000"
> is alphanumeric and 1 up to 20 characters long.  The server
> responds with a mapping document which looks like
> 
> <MAPPING>
>   <MAP from="das-private:0000"
> to="http://blah.com/das2/whatever/feature/123" />
> </MAPPING>
> 
> I decided on this instead of the "preallocate identifier"
> scheme because this requires less state on the server
> (it doesn't need to remember which identifiers were already
> issued) and because it supports versioning servers better.
> 
> 
> Is the web site being updated from CVS?  I see it hasn't gotten
> the updates I made on Monday.
> 
> 
> Andrew
> dalke at dalkescientific.com
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Fri Apr 28 13:04:30 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 28 Apr 2006 11:04:30 -0600
Subject: [DAS2] splits and joins in writeback, an alternative
Message-ID: <1e3fb6fceaa9c77cc25511181c35e45b@dalkescientific.com>

Roy, in private email, pointed out that my writeback spec doesn't
include ways to track splits and joins.  Here's my response to that
topic.  I sent it to him last night but resend it here now because
I hope to talk about it on Monday.

   ------

The use model we have is a curator works on a section of the genome
for a while (a few hours to perhaps a day).  Once done all of the
changes are sent back to the server.

The writeback document in the current draft looks like

<WRITEBACK>
  <MESSAGE>...</MESSAGE>
  <DELETES>...</DELETES>
  <TYPES>...</TYPES>
  <FEATURES>...</FEATURES>
</WRITEBACK>

The message at this point would be "I did a lot of work in the last
few hours."  It's not very useful.

Thinking of it as code, it's like working for a day on code without
checking things into version control, so you end up with commit messages
with a dozen items in them and it's hard to see which code
changes corresponds to which item.

What if the writeback delta looked like

<WRITEBACK>
  <MESSAGE>
  <CHANGE>
    <REASON>...</REASON>
    <DELETES>...</DELETES>
    <TYPES>...</TYPES>
    <FEATURES>...</FEATURES>
  </CHANGE><CHANGE>
    <REASON>...</REASON>
    <DELETES>...</DELETES>
    <TYPES>...</TYPES>
    <FEATURES>...</FEATURES>
  </CHANGE>
   ...
</WRITEBACK>

The MESSAGE is set by the person, the REASON is set by the software,
perhaps with details using a controlled vocabulary ("split",
"merge", "creation", ...)

It feels to me like this gives essentially the same information
as explicitly listing how A comes from {X0, X1, X...} features.
Perhaps not exactly the same detail, but close enough for what
people want.  On the plus side it can handle complicated changes,
like if 3 features (ranges 100-300, 310-600, 620-800) are
converted into 2 (ranges 100-500 and 510-800)

<CHANGE>
   <REASON>merged three elements into two</REASON>
   <DELETES><DELETE uri="http://whatever/the/middle/one"></DELETES>
   <FEATURES>
    <FEATURE uri="http://whatever/the/left/one" ...>
       <LOC uri="http://whatever/seg1" range="100:500"/>
       ...
    </FEATURE>
    <FEATURE uri="http://whatever/the/right/one" ...>
       <LOC uri="http://whatever/seg1" range="510:800"/>
       ...
    </FEATURE>
   </FEATURES>
</CHANGE>


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Sun Apr 30 22:37:29 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Sun, 30 Apr 2006 20:37:29 -0600
Subject: [DAS2] May 1 is a UK holiday
Message-ID: <9c6cb86e3269238eb234bb4b2c6da293@dalkescientific.com>

Andreas write to my in private email saying

> here in england  1st of may is a public holiday...

The hope idea was to talk about writeback, but the UK people (and
most specifically Roy) won't be able to make it.

Does anyone have any feedback on the writeback spec or comments
on my solution to splits and joins?


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Mon Apr  3 07:20:59 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 3 Apr 2006 01:20:59 -0600
Subject: [DAS2] daylight saving time
Message-ID: <366941fb271add552809d50a50ab2027@dalkescientific.com>

For the non-US people involved in the next phone conference call,
the US just changed to daylight saving time so California is now
7 hours behind GMT instead of 8.  I think the UK switched a week
earlier than the US which is why people there couldn't make it
last week?

					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Mon Apr  3 16:53:17 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 3 Apr 2006 12:53:17 -0400
Subject: [DAS2] daylight saving time
In-Reply-To: <366941fb271add552809d50a50ab2027@dalkescientific.com>
References: <366941fb271add552809d50a50ab2027@dalkescientific.com>
Message-ID: <200604031253.17513.lstein@cshl.edu>

Hi Guys,

I'm stuck on another conf call right now. I'll be joining in 10 min.

Lincoln

On Monday 03 April 2006 03:20, Andrew Dalke wrote:
> For the non-US people involved in the next phone conference call,
> the US just changed to daylight saving time so California is now
> 7 hours behind GMT instead of 8.  I think the UK switched a week
> earlier than the US which is why people there couldn't make it
> last week?
>
>      Andrew
>      dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060403/2e21538e/attachment.sig>

From mgibson at bdgp.lbl.gov  Mon Apr  3 16:29:55 2006
From: mgibson at bdgp.lbl.gov (mark gibson)
Date: Mon, 3 Apr 2006 12:29:55 -0400
Subject: [DAS2] Mark Gibson on Apollo writeback to Chado
In-Reply-To: <ryi64lzoerg.fsf@spongecake.lbl.gov>
References: <ryi64lzoerg.fsf@spongecake.lbl.gov>
Message-ID: <FDF88AA5-51A8-497B-9A83-46BD4B75BC7D@fruitfly.org>

Ive attached a powerpoint presentation that is probably easier to  
glance at than reading through this whole email. The first half of it  
is about apollo transactions.

Mark

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmod-sri-13.ppt
Type: application/vnd.ms-powerpoint
Size: 599552 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060403/bcb3bd34/attachment-0001.ppt>
-------------- next part --------------

On Mar 27, 2006, at 2:42 PM, Nomi Harris wrote:

> mark gibson said that he plans to attend next monday's DAS/2
> teleconference.  he also gave me permission to forward this message  
> that
> he wrote recently in response to a group that is adapting apollo and
> wondered what he thought about direct-to-chado writeback vs. the  
> use of
> chadoxml as an intermediate storage format.  FlyBase Harvard  
> prefers to
> use the latter approach because (we gather) they worry about possibly
> corrupting the database by having clients write directly to it.  if
> anyone from harvard is reading this and feels that mark has
> misrepresented their approach, please set us straight!
>
>                Nomi
>
> On 10 March 2006, Mark Gibson wrote:
>> Im rather biased as a I wrote the chado jdbc adapter [for Apollo],  
>> but let me put forth my
>> view of chado jdbc vs chado xml.
>>
>> The chado Jdbc adapter is transactional, the chado xml adapter is  
>> not. What this
>> means is jdbc only makes changes in the database that reflect what  
>> has actually
>> been changed in the apollo session, like updating a row in a  
>> table; with chado
>> xml you just get the whole dump. So if a synonym has been added  
>> jdbc will add a
>> row to the synonym table. For xml you will get the whole dump of  
>> the region you
>> were editing (probably a gene) no matter how small the edit.
>>
>> What I believe Harvard/Flybase then does (with chado xml) is wipe  
>> out the gene
>> from the database and reinsert the gene from the chado xml. The  
>> problem with
>> this approach is if you have data in the db thats not associated  
>> with apollo
>> (for flybase this would be phenotype data) then that will get  
>> wiped out as well,
>> and there has to be some way of reinstating non-apollo data. If  
>> you dont have
>> non-apollo data and dont intend on having it in the future this  
>> isnt a huge
>> issue I suppose. I think Harvard is integrating non-apollo data  
>> into their chado
>> database.
>>
>> I think what they are going to do is actually figure out all of  
>> the transactions
>> by comparing the chado xml with the chado database, which is what  
>> apollo already
>> does, but I'm not sure as Im not so in touch with them these days  
>> (as Im not
>> working with apollo these days - waiting for new grant to kick in).
>>
>> Since the paradigm with chado xml is wipe out & reload, then  
>> apollo has to make
>> sure it preserves every bit of the chado xml that came in. Theres  
>> a bunch of
>> stuff thats in chado/chado xml that the apollo datamodel is  
>> unconcerned with,
>> and has no need to be concerned with as its stuff that it doesnt  
>> visualize. In
>> other words apollos data model is solely for apollos task of  
>> visualizing data,
>> not for roundtripping what we call non-apollo data. In writing the  
>> chado xml
>> adapter for FlyBase, Nomi Harris had a heck of a time with these  
>> issues, and she
>> can elaborate on this I suppose.
>>
>> I'm personally not fond of chado xml because its basically a  
>> relational database
>> dump, so its extremely verbose. It redundantly has information for  
>> lots of joins
>> to data in other tables - like a cvterm entry can take 10 or 20  
>> lines of chado
>> xml, and a given cvterm may be used a zillion times in a given  
>> chado xml file
>> (as every feature has a cvterm). So these files can get rather large.
>>
>> The solution for this verbose output is to use what I call macros  
>> in chado xml.
>> Macros are supported by xort. They take the 15 line cvterm entry  
>> and reduce it
>> to a line or 2 making the file size much more reasonable. The  
>> apollo chado xml
>> adapter does not support macros, so you have to use unmacro'd  
>> chado xml for
>> apollo purposes. Nomi Harris had a hard enough time getting the  
>> chado xml
>> adapter working for flybase(and did a great job with a harrowing  
>> task), that she
>> did not have time to take on the macro issue. If you wanted macros  
>> (and smaller
>> file sizes) you would have to add this functionality to the chado  
>> xml adapter
>> (are there java programmers in your group?).
>>
>> One of the arguments against the jdbc adapter is that its  
>> dangerous because it
>> goes straight into the database so if there are any bugs in the  
>> data adapter
>> then the database could get corrupted - some groups find this a  
>> bit precarious.
>> This is a valid argument. I think theres 2 solutions here. One is  
>> to thoroughly
>> test the adapter out against a test database until you are  
>> confident that bugs
>> are hammered out.
>>
>> Another solution is to not go straight from apollo to the  
>> database. You can use
>> an interim format and actually use apollo to get that interim  
>> format into the
>> database. Of course one choice for interim format is chado xml and  
>> then you are
>> at the the chado xml solution. The other choice for file format is  
>> GAME xml. You
>> can then use apollo to load game into the chado database, and this  
>> can be done
>> at the command line (with batching) so you dont have to bring up  
>> the gui to do
>> it. Also chado xml can be loaded into chado via apollo as well (of  
>> course xort
>> does this as well but not with transactions)
>>
>> So then the question is if Im not going to go straight into the  
>> database, why
>> would I choose game over chado xml?  Or if Im using chado xml  
>> should I use
>> apollo or xort to load into chado. I think if you are using chado  
>> xml it makes
>> sense to use xort as it is the tried & true technology for chado  
>> xml. The
>> advantage of going through apollo is that it also uses the  
>> transactions from
>> apollo (theres a transaction xml file) and thus writes back the  
>> edits in a
>> transactional way as mentioned above rather than in a wipe out &  
>> reload fashion.
>>
>> Also Game is a tried & true technology that has been used with  
>> apollo in
>> production at flybase (before chado came along) for many years  
>> now. One
>> criticism of it has been that DTD/XSD/schema has been a moving  
>> target, nor has
>> it been described. That is not as true anymore. Nomi Harris has  
>> made a xsd for
>> it as well as a rng. But I must confess that I have recently added  
>> the ability
>> to have one level annotations in game (previously 1 levels had to  
>> be hacked as 3
>> levels). Also game is a lot less verbose than un-macro'd chado  
>> xml, as it more
>> or less fits with the apollo datamodel. One advantage of chado xml  
>> over game xml
>> is that it is more flexible in terms of taking on features of  
>> arbitrary depth.
>>
>> The chado xml adapter was developed for FlyBase and as far as I  
>> know has not
>> been taken on by any other groups yet. Nomi can elaborate on this,  
>> but I think
>> what this might mean is that there are places where things are  
>> FlyBase specific.
>> If you went with chado xml the adapter would have to be  
>> generalized. Its a good
>> exercise for the adapter to go through, but it will take a bit of  
>> work. Nomi can
>> probably comment on how hard generalizing might be. I could be  
>> wrong about this
>> but I think the current status with the chado xml adapter is that  
>> Harvard has
>> done a bunch of testing on it but they havent put it into  
>> production yet.
>>
>> The jdbc adapter is being used by several groups so has been  
>> forced to be
>> generalized. One thing I have found is that chado databases vary  
>> all too much
>> from mod to mod (ontologies change). There is a configuration file  
>> for the jdbc
>> adapter that has settings for the differences that I encountered.  
>> I initially
>> wrote it for cold spring harbors rice database that will be used  
>> in classrooms.
>> Its working for rice in theory, but they havent actually used it  
>> much in the
>> classroom yet. For rice the model is to save to game and use  
>> apollo command line
>> to save game & transactions back to chado.
>>
>> Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on  
>> the jdbc
>> adapter for his group. I have cc'd him on this email as I think he  
>> will have a
>> lot to say about the jdbc adapter. Cyril has uncovered many bugs  
>> and has fixed a
>> lot of them (thank you cyril) as hes a very savvy java programmer.  
>> And he has
>> also forced the adapter to generalize and brought about the  
>> evolution of the
>> config file to adapt to chado differences. But as Cyril can attest  
>> (Cyril feel
>> free to elaborate) it has been a lot of work to get jdbc working  
>> for him. There
>> were a lot of bugs to fix that we both went after. Hopefully now  
>> its a bit more
>> stable and the next db/mod wont have as many problems. I think  
>> Cyril is still at
>> the test phase and hasn't gone into production (Cyril?)
>>
>> Berkeley is using the jdbc adapter for an in house project. They  
>> are using the
>> jdbc reader to load up game files (as the straight jdbc reader is  
>> slow as the
>> chado db is rather slow) which are then loaded by a curator. They  
>> are saving
>> game, and then I think chris mungall is xslting game to chado xml  
>> which is then
>> saved with xort - or he is somehow writing game in another way -  
>> not actually
>> sure. The Berkeley group drove the need for 1 level annotations(in  
>> jdbc,game,&
>> apollo datmodel)
>>
>> Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they  
>> use it there. I
>> believe they are intending to use the write adapter but dont yet  
>> do so (Jonathan?).
>>
>> I should mention that reading jdbc straight from chado tends to be  
>> slow, as I
>> find that chado is a slow database, at least for Berkeley. It  
>> really depends on
>> the db vendor and the amount of data. TIGRs reading is actually  
>> really zippy.
>> The workaround for slow chados is to dump game files that read in  
>> pretty fast.
>>
>> In all fairness, you should probably email with FlyBase (& Chris  
>> Mungall) and
>> get the pros of using chado xml & xort, which they can give a far  
>> better answer
>> on than I.
>>
>> Hope this helps,
>> Mark
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From lstein at cshl.edu  Thu Apr  6 20:08:30 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Thu, 6 Apr 2006 16:08:30 -0400
Subject: [DAS2] Global IDs for worm
Message-ID: <200604061608.32914.lstein@cshl.edu>

I've created a directory in the das CVS under das2/GlobalSeqIDs/ to hold text 
files describing sequence IDs for common organisms. Currently I've created 
one for Worm. My schedule for the others is:

	Drosophilids
	Yeast
	Human
	Mouse

Drosophila is the difficult one because there are many partial sequences. I 
may just do melanogaster for now.

Lincoln


-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From dalke at dalkescientific.com  Mon Apr 10 04:24:24 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Sun, 9 Apr 2006 22:24:24 -0600
Subject: [DAS2] was ill
Message-ID: <0436e7cb5802c65cbce1a757a2a31b2f@dalkescientific.com>

Hi all,

   The reason you haven't heard from me in the last week is
I was quite ill with an upper respiratory virus, which you
heard a bit of in last week's phone conference.  I was barely
able to read a paragraph at a time, much less write anything
coherent.  It broke yesterday afternoon and I'm able to
work now.

   Strangest part was on Friday night when I dreamed about
parsing RSS feeds and every time I tried to get element [0]
I would wake up coughing.  That's some virus!

					Andrew
					dalke at dalkescientific.com


From Gregg_Helt at affymetrix.com  Mon Apr 10 17:19:23 2006
From: Gregg_Helt at affymetrix.com (Helt,Gregg)
Date: Mon, 10 Apr 2006 10:19:23 -0700
Subject: [DAS2] Problem with DAS/2 registry?
Message-ID: <C71929195D04BF48BAECD499AF717B480198CA85@msex02.affymetrix.com>

I've been trying to reach the DAS/2 registry at:
http://www.spice-3d.org/dasregistry/das2/sources

which used to work, but now I'm getting this error message:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET?/dasregistry/das2/sources.
Reason: Could not connect to remote machine: Connection refused
Apache/1.3.33 Server at www.spice-3d.org Port 80

Any idea what the problem is?

	Thanks,	
	Gregg


From dalke at dalkescientific.com  Fri Apr 14 08:29:46 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 14 Apr 2006 02:29:46 -0600
Subject: [DAS2] alignments
Message-ID: <5dd5ce9d6d6e977e56c7b4e30e622f7c@dalkescientific.com>

I need a bit of help here.  I'm trying to hand-write an example of a
feature based on an alignment.  Let's assume these are annotations on
fly and it's aligned to human.  There's a hit from

fly chromosome 4
   http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4
   range 100:200

to human chromosome 8
   http://www.ensembl.org/Homo_sapiens/Chr1
   range 200:300

Assume the CIGAR string of the match is
    51 identical, 3 insertions, 24 identical, 3 deletions, 25 identical

Here's the best I can manage:

   <FEATURES xmlns="http://biodas.org/document/das2/">
    <FEATURE uri="feature/00094" type="type/alignment" title="Human  
genome alignment">
     <LOC  
segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
        range="100:200" cigar="?????"/>
    </FEATURE>
   </FEATURE>


First question:
   Where do I put the object to which the alignment aligns?  Will
it be a segment or a feature?  Now, I could have this completely wrong
and DAS2 is not meant for genome/genome alignments like this.  If
that's the case please offer an example of how to write an alignment.


Second question:
   What's the format of the CIGAR string?  Lincoln's text pointed to
      http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html

That documentation says:
> The format starts with the same 9 fields as sugar output (see above),  
> and is followed by a series of <operation, length> pairs where  
> operation is one of match, insert or delete, and the length describes  
> the number of times this operation is repeated.

However, it does not list the operation characters nor if there are  
spaces
between the fields.  I assume it is "M 51 I 3 M 24 D 3 25 I", though  
perhaps
without spaces.

The GFF3 documentation at http://song.sourceforge.net/gff3.shtml refers  
to
  http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl
but I can find no relevant documentation there.

I then found a comment by Richard Durbin from two years ago, at
    
http://portal.open-bio.org/pipermail/bioperl-l/2003-February/ 
011234.html

> 3) I'm not convinced by the format for the Align string.  This requires
> a character per aligned base.  There are a variety of run-length type
> encodings in common use that are much more compact.  e.g. Ensembl uses  
> a
> string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8
> match, then 3 insert, then 15 match".  They call this CIGAR, but when I
> talked to Guy Slater, who invented CIGAR for exonerate, his version is
> subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see
> http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
> Jim Kent also has something like this.  I'd prefer us to standardise on
> one of these formats, all of which are very short for ungapped matches.

Which is the CIGAR string format DAS2 supports?  Where is the
documentation for it?


					Andrew
					dalke at dalkescientific.com


From aloraine at gmail.com  Sat Apr 15 00:05:17 2006
From: aloraine at gmail.com (Ann Loraine)
Date: Fri, 14 Apr 2006 19:05:17 -0500
Subject: [DAS2] question regarding most up-to-date D. melanogaster DAS?
Message-ID: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>

Hi,

I'm helping a colleague with an eQTL study and need to do a
region-based query on the most up-to-date fruit fly annotations.

Our markers (for influential loci in the study) are mapped to
cytological bands. Is it possible to run region-based queries using
cytological coordinates? (e.g., 30A - 30B, inclusive) My goal is to
find all candidate genes under those peaks.

I also have (approximate) mappings of cytological bands onto the
physical (genomic coordinates) map of Drosophila, so, if necessary, I
could use  those to collect the genes mapping to those locations.

Which fruit fly DAS server would provide the most up-to-date
information? If you have other recommendations for how to proceed, I
would be grateful for your help!

All the best,

Ann

--
Ann Loraine
Assistant Professor
Section on Statistical Genetics
University of Alabama at Birmingham
http://www.ssg.uab.edu
http://www.transvar.org


From dalke at dalkescientific.com  Mon Apr 17 06:54:30 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 00:54:30 -0600
Subject: [DAS2] updated spec
Message-ID: <b8dd7d0a7c25ec41e739b483ad29f7a0@dalkescientific.com>

Spec writing is like working on a dissertation.  Here's an
example, in the form of a text adventure

http://acephalous.typepad.com/acephalous/2006/04/disadventure.html

 > look laptop
There seems to be a dissertation chapter on the laptop.
 > read chapter
It is long-winded and boring.  You do not want to read it.
 > read chapter
It is obnoxious.  You hate it.
 > read book
Read.  There is a book underneath it that concerns a related topic.
 > read book
Read.  There is a book underneath it that concerns a related topic.
 > work on dissertation
You spend two hours searching the OED for the usage history of the word 
devolve.
 > work on dissertation
You spend three hours reading five articles which have nothing to do 
with the dissertation.
 > work on dissertation
You spend twenty minutes online reading about baseball.
   ...
 > work on dissertation
You spend five minutes playing online poker.
 > work on dissertation
You pick your nose.
 > work on dissertation
You go to the kitchen and eat cheese.
 > work on dissertation
The Mets are on.  It should be a good game.

Anyway, I've gone through the das/das2/draft3/spec.txt document and
updated everything (well, not writeback.  I'm going to need more 
cheese.)

Next is to get feedback, validate my inline examples, and convert the
behemoth into HTML, to replace what's on the web site.  Finally.

					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Mon Apr 17 07:31:13 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 01:31:13 -0600
Subject: [DAS2] outstanding questions
Message-ID: <ff4b778eb025c637263ed2d0831b2420@dalkescientific.com>

These are culled from the current draft of the spec.  I used "XXX"
to denote regions where I had questions.

1) type ontology URI

The TYPE elements have an 'ontology' attribute.  This is supposed
to be a required element, which is the URI of the corresponding
ontology term.

At present there is no URI system for ontology.  We added a special
'accession' attribute which is the GO id, as in

       so_accession="SO:0000704"

This was meant to be a hack for the hackathon.

My thought is:
   - keep the GO accession (as an optional attribute)
   - make 'ontology' be an optional attribute, but one of 'ontology'
       or 'so_accession' is required

Also, should that be "SO:0000704" or simply "0000704" ?  I think
the "SO:" should be present.


2)  Feature strand.

I want to make sure this is correct

   1 for positive
  -1 for negative
   0 for unknown
   not given for both strands or does not have meaning

3)  taxid

The 'taxid' in  the SOURCE element does not appear to be useful.
It's written

   <SOURCE uri="volvox" title="Volvox Database" writeable="no"
       doc_href="http://www.example.org/volvox_db.pdf" taxid="3066">

     <VERSION uri="volvox/build_1" title="Build 1, October 2002"
            created="2002-10-15" modified="2002-10-25T09:56:23">

       <COORDINATES uri="http://ncbi.nlm.nih.gov/das-genomes/human-35"
                    taxid="3066" source="chromosome" authority="NCBI" 
version="35" />

       <COORDINATES uri="http://embl.ebi.ac.uk/genome/volvox-clone"
                    taxid="2034" source="clone" authority="EMBL" />

Notice how the taxid exists in the SOURCE element and the COORDINATES
element (and how there are difference taxids for each COORDINATES)?

I think we can drop 'taxid' from the SOURCE element and if it's
important someone should have a COORDINATES element.

4)  'writeable'

The versioned source element contains the attribute "writeable", as in

     <VERSION uri="volvox/build_1" title="Build 1, October 2002"
            writeable="no" created="2002-10-15" 
modified="2002-10-25T09:56:23">

Do we need that 'writeable' attribute?  It seems that if there's a
writeback capability then then versioned source is writeable.

5) content-type for FASTA records

"text/plain", "text/x-fasta" or "chem/x-fasta"

Looking around now I also see "application/x-fasta" and 
"application/fasta".

I'm going to say "should be text/x-fasta but may be text/plain".
Objections?

6) response document too large

I've described that a server may return an error if the response 
document
is too large.  This means a client may try again, hopefully making a
request which returns a smaller document.

My question is, how does a client make a smaller request?  What if the
server decides that sending more than 5 features at a time is too much?
When does the client just give up and say the server implementation is
crazy?

7) styles

Are we going to go with the current style system or some other
approach?

The DAS1 styles had support for limited semantic zooming, with
options for "high", "medium" and "low" resolution.  What do those
mean?  When should a client choose one over another?

What does "height" mean for a glyph?  How do the glyph and text
interoperate?  Eg, is the "height" the height for both, or just
for the glyph?

Should style information be moved outside of the DAS2 exchange
spec?


8) the "count" format

We talked about, and people wanted, a "count" format.  This returns
the number of features which would be returned in a query.

Does it really return the number of features, or does it return the
number of complex annotations (eg, if there is a complex annotation
with a root and two children, is that a count of "1" or a count of "3"?
Given the way we've done things, I'm going with "3".)

9) alignments

How do I write an alignment?  Please give an example - I can't
figure it out.

10) CIGAR string

What's the format of the CIGAR string?  I've found two main
variations.  They are
   M 40 I 1 M 12 D 4
   40M1I12M4D

The latter appears to be the most common.  However, I did see one
case where if no count is given "1" is implied, so the latter can
also be written
   40MI12M4D

10) Do we need a REGION element?  I've written

   All feature locations are given in coordinates on a segment.  Some
   features may be locatable on other features.  For example, a contig
   feature may be locatable on a supercontig.  This relationship is
   stored using a REGION element.  A FEATURE element has zero or more
   REGION elements.  The 'feature' attribute of the REGION element
   contains the URI of the parent feature, on which the current feature
   is located.  A REGION record has an optional 'range' attribute.  If
   not given the feature is on the entire parent feature.  The range
   string is the same syntax and meaning as in the LOC record.

   XXX I think this is overkill - what are some good examples of use;
   perhaps when the global coordinates are not well-defined?.  Are
   negative coordiantes important, like "promoter region is 20 bases
   upstream from some gene"?  Does this need a CIGAR string too? XXX

   For example, suppose feature A is 6 bases long and is on chromosome 5
   at position 10000, on exon X at position 300 and on contig K at
   position 7.  The FEATURE record for this feature may be as follows:

   <?xml version="1.0" encoding="UTF-8"?>
   <!-- XXX fix this -->
   <FEATURES  xmlns="http://www.biodas.org/documents/das2"
        xml:base="http://www.biodas.org/das2/sequence/volvox/v3/">
     <FEATURE uri="feature/A" type="type/Type_A">
       <LOC segment="segment/5" range="10000:10006">

       <REGION id="feature/exon_X" range="300:306" />
       <REGION id="feature/contig_K" range="7:13" />
     </FEATURE>
   </FEATURES>


11) XID

Currently the XID element has a single attribute, 'href'.  I wrote

   A FEATURE has zero or more XID elements linking the feature record to
   an external database entry.  XXX This is not well-thought out.  I
   think it should have:
     'uri' -- a URL or LSID
     'authority' -- the name of the database (controlled vocabulary)
     'type' -- 'primary', 'accession', or possibly others?
     'id' -- the actual identifier
     'description' -- a paragraph or so describing the link, for humans
        to see why they might want to look into a link
   This has to be a well-defined concept.  Let's steal from someone else.
   The use-case here is to link to sequence records in other databases
   and to link to PubMed or other bibliographic databases.

12)  complex features

In the spec I wrote

   Some features are complex and cannot easily be modeled with a single
   feature record.  Quoting from the "Chado Schema Documentation" XXX
   give hyperlink XXX

      The class of transplicing events that involve ligating transcripts
      from different loci into a mature mRNA requires a separate feature
      to represent each locus transcript and one to represent the fused
      transcript. The fragments are located on the fused transcript;
      portions of the fused transcript can also be located on the genome.

Is this a relevant example of a complex feature for DAS2?  If not,
give another example.


In general I'm having a hard time coming up with good examples of
various forms of complex features.  I just don't know the domain
well enough.

13) "root" attribute

I proposed that features have a new, optional attribute called "root".  
If
a feature is part of a complex annotation then the "root" attribute must
be present and it must have the URI of the root feature for the 
annotation.

This makes client processing easier, though it is not needed in the
purest of senses.

14) features have a 'STYLE' element

The idea was that an individual feature could override the style
given in the feature type record.  I don't think that's useful
and/or we need a real stylesheet instead.  I'm going to drop the
STYLE element from the FEATURE element unless there is objection.

15) In text searches we've defined

     ABC  -- field exactly matches "ABC"
    *ABC  -- field ends with "ABC"
     ABC* -- field starts with "ABC"
    *ABC* -- field contains the substring "ABC"

I want to say that using "*" and "?" elsewhere in the query string
is implementation dependent.  That is, "A*B" might match everything
with an A followed by a B or it might match the exact string "A*B"
and only that string.

I did this because looking around at various tools it looks like
it might be hard to change the meaning of "*" and "?" for the
text searches.


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Mon Apr 17 07:40:07 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 01:40:07 -0600
Subject: [DAS2] proposed April 17 agenda
Message-ID: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>

Gregg is taking the month off.  I volunteered to be in charge
of the next teleconference.

Here is what I would like to talk about:

  1. get additional agenda items

  2. status reports

  3. who maintains the list of reference names for different
       genomes (starting with the list Licoln developed)?

  4. resolve some questions with the spec (see my previous email)

  5. get a volunteer to come up with best-practices examples
       of how to represent various complex annotations

  6. writeback planning


					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Mon Apr 17 13:46:23 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 17 Apr 2006 09:46:23 -0400
Subject: [DAS2] alignments
In-Reply-To: <5dd5ce9d6d6e977e56c7b4e30e622f7c@dalkescientific.com>
References: <5dd5ce9d6d6e977e56c7b4e30e622f7c@dalkescientific.com>
Message-ID: <200604170946.24479.lstein@cshl.edu>

I didn't realize there were multiple things called CIGAR. I think we should 
use Ensembl CIGAR format.

The target of the alignment should be a segment, and not another feature.

Best,

Lincoln


On Friday 14 April 2006 04:29, Andrew Dalke wrote:
> I need a bit of help here.  I'm trying to hand-write an example of a
> feature based on an alignment.  Let's assume these are annotations on
> fly and it's aligned to human.  There's a hit from
>
> fly chromosome 4
>    http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4
>    range 100:200
>
> to human chromosome 8
>    http://www.ensembl.org/Homo_sapiens/Chr1
>    range 200:300
>
> Assume the CIGAR string of the match is
>     51 identical, 3 insertions, 24 identical, 3 deletions, 25 identical
>
> Here's the best I can manage:
>
>    <FEATURES xmlns="http://biodas.org/document/das2/">
>     <FEATURE uri="feature/00094" type="type/alignment" title="Human
> genome alignment">
>      <LOC
> segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
>         range="100:200" cigar="?????"/>
>     </FEATURE>
>    </FEATURE>
>
>
> First question:
>    Where do I put the object to which the alignment aligns?  Will
> it be a segment or a feature?  Now, I could have this completely wrong
> and DAS2 is not meant for genome/genome alignments like this.  If
> that's the case please offer an example of how to write an alignment.
>
>
> Second question:
>    What's the format of the CIGAR string?  Lincoln's text pointed to
>       http://www.ebi.ac.uk/~guy/exonerate/exonerate.man.1.html
>
> That documentation says:
> > The format starts with the same 9 fields as sugar output (see above),
> > and is followed by a series of <operation, length> pairs where
> > operation is one of match, insert or delete, and the length describes
> > the number of times this operation is repeated.
>
> However, it does not list the operation characters nor if there are
> spaces
> between the fields.  I assume it is "M 51 I 3 M 24 D 3 25 I", though
> perhaps
> without spaces.
>
> The GFF3 documentation at http://song.sourceforge.net/gff3.shtml refers
> to
>   http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl
> but I can find no relevant documentation there.
>
> I then found a comment by Richard Durbin from two years ago, at
>
> http://portal.open-bio.org/pipermail/bioperl-l/2003-February/
> 011234.html
>
> > 3) I'm not convinced by the format for the Align string.  This requires
> > a character per aligned base.  There are a variety of run-length type
> > encodings in common use that are much more compact.  e.g. Ensembl uses
> > a
> > string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8
> > match, then 3 insert, then 15 match".  They call this CIGAR, but when I
> > talked to Guy Slater, who invented CIGAR for exonerate, his version is
> > subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see
> > http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
> > Jim Kent also has something like this.  I'd prefer us to standardise on
> > one of these formats, all of which are very short for ungapped matches.
>
> Which is the CIGAR string format DAS2 supports?  Where is the
> documentation for it?
>
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


From dalke at dalkescientific.com  Mon Apr 17 16:19:47 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 17 Apr 2006 10:19:47 -0600
Subject: [DAS2] question regarding most up-to-date D. melanogaster DAS?
In-Reply-To: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>
References: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>
Message-ID: <3ecdfacc003d58cc93045bc7a4aefb57@dalkescientific.com>

Ann:
> Our markers (for influential loci in the study) are mapped to
> cytological bands. Is it possible to run region-based queries using
> cytological coordinates? (e.g., 30A - 30B, inclusive) My goal is to
> find all candidate genes under those peaks.

At present there is no way to do that.

A server can extend the query syntax to support searches in
cytological coordinates and add new feature elements to store
those coordinates.  I don't know enough about how people use
those coordinates to sketch an example.

					Andrew
					dalke at dalkescientific.com


From aloraine at gmail.com  Mon Apr 17 17:47:03 2006
From: aloraine at gmail.com (Ann Loraine)
Date: Mon, 17 Apr 2006 12:47:03 -0500
Subject: [DAS2] question regarding most up-to-date D. melanogaster DAS?
In-Reply-To: <3ecdfacc003d58cc93045bc7a4aefb57@dalkescientific.com>
References: <83722dde0604141705t369cd016u30f1ca2ea7622d6c@mail.gmail.com>
	<3ecdfacc003d58cc93045bc7a4aefb57@dalkescientific.com>
Message-ID: <83722dde0604171047r26a32986gaa4c3b34b6166c16@mail.gmail.com>

I'm not sure it would be worth adding more work to the project to
allow for these cases. If funding is renewed, then I think it would be
worth the effort. But for now, probably not, since it would be a new
feature. (At this stage, avoiding feature creep seems advisable :-)

I believe I can get a mapping of cytological bands onto genomic
coordinates from FlyBase. I don't know how reliable these mappings
are, but assuming they are okay, I can use them to query a fly DAS
site to get the genes in those coordinates. I'm not sure what is the
best DAS site to use for this, however.

-Ann

On 4/17/06, Andrew Dalke <dalke at dalkescientific.com> wrote:
> Ann:
> > Our markers (for influential loci in the study) are mapped to
> > cytological bands. Is it possible to run region-based queries using
> > cytological coordinates? (e.g., 30A - 30B, inclusive) My goal is to
> > find all candidate genes under those peaks.
>
> At present there is no way to do that.
>
> A server can extend the query syntax to support searches in
> cytological coordinates and add new feature elements to store
> those coordinates.  I don't know enough about how people use
> those coordinates to sketch an example.
>
>                                         Andrew
>                                         dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>


--
Ann Loraine
Assistant Professor
Section on Statistical Genetics
University of Alabama at Birmingham
http://www.ssg.uab.edu
http://www.transvar.org


From dalke at dalkescientific.com  Tue Apr 18 07:36:39 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Tue, 18 Apr 2006 01:36:39 -0600
Subject: [DAS2] proposed April 17 agenda
In-Reply-To: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
Message-ID: <d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>

Summary of today's conference call.

>  2. status reports

The biggest one is that the new version of IGB is out and
the Affy DAS server is available at
   http://netaffxdas.affymetrix.com/das2/sequence

Steve and Ed (as I recall) tracked down a problem with
that server which might affect other implementations.  The
problem is knowing the public/external URL for the DAS service.
In theory it can be determined by looking at various CGI
headers, but with things like an Apache rewrite and forwards
to the actual server it can get complicated.  The solution
seems to be either use relative links or have a configuration
option in the server specifying the base name.

Lincoln's been working on reference names.  Allen's been
working on how the writeback server might work.  I've been
working on the spec, and have not gone further with the
validator.

>  3. who maintains the list of reference names for different
>       genomes (starting with the list Licoln developed)?

Lincoln proposed, to broad acceptance, that we set up a
wiki page with the reference names.

The easiest way is to use the OBF wiki, at
   http://open-bio.org/wiki/Main_Page
because that is already set up.  I can ask the OBF about
the appropriateness of that - I think it's fine.

>  4. resolve some questions with the spec (see my previous email)

Here are the resolutions:

1) type ontology URI

I've emailed Suzi asking about plans for GO, the Gene Ontology
Consortium, whoever in coming up with standardized, public
ontology URLs.  Allen's cc'ed on it, and we'll discuss this
off the DAS list.

2)  Feature strand.

I stand corrected.  The definitions are

   1 for positive
  -1 for negative
   0 both strands
   not don't know or does not have meaning

3)  taxid

There seems to be no reason to keep the 'taxid' in the SOURCE
element.  We'll only have it in the COORDINATES element.

4) 'writeable'

We'll defer this (leaving it as-is) until we have the writeback
defined a bit better.

5) content-type for FASTA records

We'll recommend "text/x-fasta" or "text/plain" as the content-type
for FASTA responses.  There is no widely accepted community standard.

6) response document too large

There is no automatic way for a client to narrow its request.
This must be done by a person, depending on what the search
criteria are.  Servers should support large requests so that
this isn't a problem.

7) styles

We'll shift to using a stylesheet.  This will be listed in the
versioned source record as

   <CAPABILITY type="stylesheet" query_uri="blah_blah.xml" />

As a rough sketch the document will look like

<STYLE uri="http://url/for/feature/type"
    zoom="high" fgcolor="red" bgcolor="black">
  <BOX />
  <LABEL font_family="monospace">
</STYLE>

<STYLE uri="http://url/for/feature/type"
    zoom="medium" fgcolor="red" bgcolor="black">
  <LINE line_width="3px" />
  <LABEL font_family="monospace">
</STYLE>

<STYLE uri="http://url/for/feature/type"
    zoom="medium" fgcolor="red" bgcolor="black">
  <LINE line_width="1px" />
</STYLE>

The STYLE elements add a new "uri" attribute which is
the URI of the feature type being styled.

In theory this could also include the feature uri (to define
the style for a single feature) or an ontology uri (sets the
style for all features with that ontology term or its descendants).

However, with that comes problems of precedence.  If the
feature type and the feature and the ontology each have
styles, which one wins?  I think feature beats type beats ontology.
But I also think we can ignore this because no one has asked
for this sort of flexibility.

(More flexibility would be support for a query language selecting
which features, types, sources, ontologies, feature alias, etc.
should get a given style.  Not going there.  :)

8) the "count" format

This should be the number of feature elements returned, and not
the number of "annotations" (counting the multiple features of a
complex annotation as 1)

9) alignments

Lincoln will provide examples.

10) CIGAR string

We'll use the EBI style CIGAR strings, and the documentation will
be based on the GFF3 description at
   http://song.sourceforge.net/gff3.shtml

10.5) Do we need a REGION element?

No.  Deleted from the spec.

11) XID

On Ed's recommendation I'm looking at MAGE XML.  I am not a
good UML reader so it's slow going.  My view so far is that
what I sketched out is on the right track and we can simplify
things compared to MAGE, eg, we don't need full bibliographic
records.

The other idea is to defer finalizing this until people start
providing data with XIDs, so we know what's needed.

12)  complex features

Lincoln will come up with some examples.

13) "root" attribute

There are two changes here:
   - complex annotations must have a single root feature
   - all features which are in complex annotations must have
       a link to the root element

There's some worry about the first requirement, in that
some complex annotations may not have a "real" root.  I
argue that having a synthetic one is okay.  There were no
strong arguments against having a single root.

We decided to defer finalizing this until we have some
example of complex annotations.

14) features have a 'STYLE' element

no, they don't.

15) "*" and "?" in the query string

The proposal here is to say that the interpretation of
"*" other than at the start and/or end of the query
string is implementation defined, as is the use of "?".
It used to be that any other use of "*" must be treated
as an asterisks, so "***" finds all strings containing
a "*".

It looks like people are fine with this looseness.


>  5. get a volunteer to come up with best-practices examples
>       of how to represent various complex annotations

That's Lincoln.

>  6. writeback planning

Allen will take the implementation lead on this, funding
willing.  He's currently working on how to associate an
identifier with a new feature.

One thought is to progress in stages:
  - upload completely new features / complex annotations to the server
  - modify an existing feature, though not the parent/part relationship
      (eg, change the location)
  - delete a simple feature
  - delete a complex annotation
  - modify an existing complex annotation, or turn a simple feature
     into a complex annotation
  - do 'em all at once

The work will need to be server driven as the current clients
can't handle this before the end of the funding period.  The
clients will mostly be library code.

					Andrew
					dalke at dalkescientific.com


From lstein at cshl.edu  Mon Apr 24 12:35:21 2006
From: lstein at cshl.edu (Lincoln Stein)
Date: Mon, 24 Apr 2006 08:35:21 -0400
Subject: [DAS2] Not able to make it today
In-Reply-To: <d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
	<d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
Message-ID: <200604240835.21690.lstein@cshl.edu>

Hi All,

Due to wedding preparations I will be unable to attend the conference call 
today. I might or might not be able to make it next week (I'll be in Toronto) 
but I'll let you know in advance.

Best,

Lincoln

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/8ca83d87/attachment.sig>

From dalke at dalkescientific.com  Mon Apr 24 16:11:31 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 24 Apr 2006 10:11:31 -0600
Subject: [DAS2] April 24 meeting - cancel?
In-Reply-To: <200604240835.21690.lstein@cshl.edu>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
	<d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
	<200604240835.21690.lstein@cshl.edu>
Message-ID: <b9ffc2e4620834562e66f6a61a093f25@dalkescientific.com>

Hi all,

   I'm trying to come up with an agenda but I've done very little
the last week DAS related.  I've been working on selling my house.
Looks like this will be a short meeting, or should we just cancel?

Here's my status.

  - Sent mail to Suzi asking about URIs for ontologies.  Heard
      nothing from her yet.

  - Talked with the OBF people about setting up a wiki for the
reference names for the genomes/segments.  We decided to use the
OBF wiki for now and if there are enough pages we'll migrate over
to a biodas-specific wiki.  I'm about 1/2-way through, learning
wiki syntax.  I'll email when it's there.

   - I've migrated the spec 300 doc into CVS.  Just checked it
in.  There's still some formatting issues though.

   - started working on the stylesheet spec.  Should take another
3 hours or so.

   - haven't been able to log into cgi.biodas.org to restart the
validation server.

   - still need to write an rnc for the writeback for Allen

					Andrew
					dalke at dalkescientific.com


From allenday at ucla.edu  Mon Apr 24 16:29:09 2006
From: allenday at ucla.edu (Allen Day)
Date: Mon, 24 Apr 2006 09:29:09 -0700
Subject: [DAS2] April 24 meeting - cancel?
In-Reply-To: <b9ffc2e4620834562e66f6a61a093f25@dalkescientific.com>
References: <4fb9a13f4a18a6e1275256affbb97a51@dalkescientific.com>
	<d5b6268d301d7a292a495ead0440f5d3@dalkescientific.com>
	<200604240835.21690.lstein@cshl.edu>
	<b9ffc2e4620834562e66f6a61a093f25@dalkescientific.com>
Message-ID: <5c24dcc30604240929l7a882dd9qa15c0a51bd636cb0@mail.gmail.com>

Let's cancel it.  I have a database set up for writeback, and am able to
POST delta XML to the server.  I am still at the stage where I am parsing
the XML.  The DTD would be helpful.

See attached figure "writeback.png" for the current implementation track.  I
am at the "Parse XML" step in implementation.

See attached "vsourcecommand.png" for an overview of the previous writeback
plans as documented in the HTML docs, and "vsourcelock.png" for an overview
of lock plans as documented in the HTML docs.  Parts of these may at some
point be helpful for folding into the current implementation.

I can send or commit to CVS the source documents for any of these diagrams
if people would like to edit.

-Allen

On 4/24/06, Andrew Dalke <dalke at dalkescientific.com> wrote:
>
> Hi all,
>
>    I'm trying to come up with an agenda but I've done very little
> the last week DAS related.  I've been working on selling my house.
> Looks like this will be a short meeting, or should we just cancel?
>
> Here's my status.
>
>   - Sent mail to Suzi asking about URIs for ontologies.  Heard
>       nothing from her yet.
>
>   - Talked with the OBF people about setting up a wiki for the
> reference names for the genomes/segments.  We decided to use the
> OBF wiki for now and if there are enough pages we'll migrate over
> to a biodas-specific wiki.  I'm about 1/2-way through, learning
> wiki syntax.  I'll email when it's there.
>
>    - I've migrated the spec 300 doc into CVS.  Just checked it
> in.  There's still some formatting issues though.
>
>    - started working on the stylesheet spec.  Should take another
> 3 hours or so.
>
>    - haven't been able to log into cgi.biodas.org to restart the
> validation server.
>
>    - still need to write an rnc for the writeback for Allen
>
>                                         Andrew
>                                         dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writeback.png
Type: image/png
Size: 41093 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/eca77d9b/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vsourcelock.png
Type: image/png
Size: 91466 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/eca77d9b/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vsourcecommand.png
Type: image/png
Size: 49552 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060424/eca77d9b/attachment-0005.png>

From dalke at dalkescientific.com  Mon Apr 24 17:39:29 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Mon, 24 Apr 2006 11:39:29 -0600
Subject: [DAS2] sequence names on wiki
Message-ID: <6e4986bba9736f1c43f239646b8a22d4@dalkescientific.com>

I've imported Lincoln's list of global sequence identifiers onto
the open-bio wiki at

    http://open-bio.org/wiki/DAS:GlobalSeqIDs


					Andrew
					dalke at dalkescientific.com


From dalke at dalkescientific.com  Thu Apr 27 07:33:29 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 27 Apr 2006 01:33:29 -0600
Subject: [DAS2] writeback spec
Message-ID: <abc84b2d3ea28d5da1018ebe7a382b12@dalkescientific.com>

I've written up a draft of the writeback spec.  It's in CVS.

   das/das2/das2_writeback.html
with the RNC in
   das/das2/writeback.rnc -- for the writeback document
   das/das2/mapping.rnc -- for the mapping from old URLs to new

On the question of how to handle new records, which need
new identifiers, I decided to go with the private identifier
scheme.  The client uses "das-private:0000" where the "0000"
is alphanumeric and 1 up to 20 characters long.  The server
responds with a mapping document which looks like

<MAPPING>
  <MAP from="das-private:0000" 
to="http://blah.com/das2/whatever/feature/123" />
</MAPPING>

I decided on this instead of the "preallocate identifier"
scheme because this requires less state on the server
(it doesn't need to remember which identifiers were already
issued) and because it supports versioning servers better.


Is the web site being updated from CVS?  I see it hasn't gotten
the updates I made on Monday.


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Thu Apr 27 17:34:12 2006
From: Steve_Chervitz at affymetrix.com (Chervitz, Steve)
Date: Thu, 27 Apr 2006 10:34:12 -0700
Subject: [DAS2] writeback spec
In-Reply-To: <abc84b2d3ea28d5da1018ebe7a382b12@dalkescientific.com>
Message-ID: <C0764EA4.1E0A6%Steve_Chervitz@affymetrix.com>


Andrew,

> From: Andrew Dalke <dalke at dalkescientific.com>
> Date: Thu, 27 Apr 2006 01:33:29 -0600
> To: DAS/2 <das2 at lists.open-bio.org>
> Subject: [DAS2] writeback spec
> 
> I've written up a draft of the writeback spec.  It's in CVS.

Great. Thanks.

> 
> Is the web site being updated from CVS?  I see it hasn't gotten
> the updates I made on Monday.

You mean in some automated fashion? Before we switched to generating the
html from templates, I set up a cron that updated the manually edited html
file for the read spec on biodas.org. I don't know if there is an automated
process that produces the template-based html from CVS on biodas.org --
unless you or Lincoln set something up.

BTW, I can't ssh into portal.open-bio.org, or even ping it. This is (or
perhaps was) the machine hosting biodas.org. Do you the story here?

Steve


From dalke at dalkescientific.com  Thu Apr 27 17:55:55 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Thu, 27 Apr 2006 11:55:55 -0600
Subject: [DAS2] writeback spec
In-Reply-To: <C0764EA4.1E0A6%Steve_Chervitz@affymetrix.com>
References: <C0764EA4.1E0A6%Steve_Chervitz@affymetrix.com>
Message-ID: <a5293a6646ae90c170dd238123e84027@dalkescientific.com>

Steve:
> You mean in some automated fashion? Before we switched to generating 
> the
> html from templates, I set up a cron that updated the manually edited 
> html
> file for the read spec on biodas.org. I don't know if there is an 
> automated
> process that produces the template-based html from CVS on biodas.org --
> unless you or Lincoln set something up.

I didn't set anything up.  One thing to note though is that I'm not 
using
the template system for the current specs.  The validator I have now is
much more powerful than the one then so I'm parsing the spec documents
and validating them.  "More powerful" includes that I can report the
error line as it is in the spec document and not just in the piece
of XML to validate.

It should be possible to just pull the specs out of CVS.

> BTW, I can't ssh into portal.open-bio.org, or even ping it. This is (or
> perhaps was) the machine hosting biodas.org. Do you the story here?

Chris Dag. sent out an email on 3/23 "Important news for all developers
ith open-bio.org CVS access

    (2) All of our websites have been consolidated on the new server
    newportal.open-bio.org


					Andrew
					dalke at dalkescientific.com


From Steve_Chervitz at affymetrix.com  Thu Apr 27 18:09:09 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 27 Apr 2006 11:09:09 -0700
Subject: [DAS2] writeback spec
In-Reply-To: <a5293a6646ae90c170dd238123e84027@dalkescientific.com>
Message-ID: <C07656D5.1E0B6%Steve_Chervitz@affymetrix.com>


Andrew:

> I didn't set anything up.  One thing to note though is that I'm not
> using
> the template system for the current specs.  The validator I have now is
> much more powerful than the one then so I'm parsing the spec documents
> and validating them.  "More powerful" includes that I can report the
> error line as it is in the spec document and not just in the piece
> of XML to validate.
> 
> It should be possible to just pull the specs out of CVS.

Cool. I can look into updating my cronjob to grab the new specs.
 
> Steve:
>> BTW, I can't ssh into portal.open-bio.org, or even ping it. This is (or
>> perhaps was) the machine hosting biodas.org. Do you the story here?
> 
> Chris Dag. sent out an email on 3/23 "Important news for all developers
> ith open-bio.org CVS access
> 
>     (2) All of our websites have been consolidated on the new server
>     newportal.open-bio.org

Yep. Just realized that. At the moment, I can't access my account on this
new server. Probably my password got reset. I've got a support request in.

Steve


From Steve_Chervitz at affymetrix.com  Thu Apr 27 19:16:23 2006
From: Steve_Chervitz at affymetrix.com (Steve Chervitz)
Date: Thu, 27 Apr 2006 12:16:23 -0700
Subject: [DAS2] writeback spec
In-Reply-To: <abc84b2d3ea28d5da1018ebe7a382b12@dalkescientific.com>
Message-ID: <C0766697.1E0D1%Steve_Chervitz@affymetrix.com>

OK, Andrew's writeback spec is now accessible at:

http://www.biodas.org/documents/das2/das2_writeback.html

Be sure to refresh your browsers to get the latest spec at
http://biodas.org/documents/das2/das2_protocol.html

I re-established my cronjob to update all the documents in this das2
directory twice daily (00:01 and 12:01 East coast time).

This das2 directory is a new cvs checkout. I moved the previous das2
directory to das2.old, in case it contains anything we might need that isn't
in CVS (accessible via http://www.biodas.org/documents/das2.old/ ).

Steve

> From: Andrew Dalke <dalke at dalkescientific.com>
> Date: Thu, 27 Apr 2006 01:33:29 -0600
> To: DAS/2 <das2 at lists.open-bio.org>
> Subject: [DAS2] writeback spec
> 
> I've written up a draft of the writeback spec.  It's in CVS.
> 
>    das/das2/das2_writeback.html
> with the RNC in
>    das/das2/writeback.rnc -- for the writeback document
>    das/das2/mapping.rnc -- for the mapping from old URLs to new
> 
> On the question of how to handle new records, which need
> new identifiers, I decided to go with the private identifier
> scheme.  The client uses "das-private:0000" where the "0000"
> is alphanumeric and 1 up to 20 characters long.  The server
> responds with a mapping document which looks like
> 
> <MAPPING>
>   <MAP from="das-private:0000"
> to="http://blah.com/das2/whatever/feature/123" />
> </MAPPING>
> 
> I decided on this instead of the "preallocate identifier"
> scheme because this requires less state on the server
> (it doesn't need to remember which identifiers were already
> issued) and because it supports versioning servers better.
> 
> 
> Is the web site being updated from CVS?  I see it hasn't gotten
> the updates I made on Monday.
> 
> 
> Andrew
> dalke at dalkescientific.com
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2


From dalke at dalkescientific.com  Fri Apr 28 17:04:30 2006
From: dalke at dalkescientific.com (Andrew Dalke)
Date: Fri, 28 Apr 2006 11:04:30 -0600
Subject: [DAS2] splits and joins in writeback, an alternative
Message-ID: <1e3fb6fceaa9c77cc25511181c35e45b@dalkescientific.com>

Roy, in private email, pointed out that my writeback spec doesn't
include ways to track splits and joins.  Here's my response to that
topic.  I sent it to him last night but resend it here now because
I hope to talk about it on Monday.

   ------

The use model we have is a curator works on a section of the genome
for a while (a few hours to perhaps a day).  Once done all of the
changes are sent back to the server.

The writeback document in the current draft looks like

<WRITEBACK>
  <MESSAGE>...</MESSAGE>
  <DELETES>...</DELETES>
  <TYPES>...</TYPES>
  <FEATURES>...</FEATURES>
</WRITEBACK>

The message at this point would be "I did a lot of work in the last
few hours."  It's not very useful.

Thinking of it as code, it's like working for a day on code without
checking things into version control, so you end up with commit messages
with a dozen items in them and it's hard to see which code
changes corresponds to which item.

What if the writeback delta looked like

<WRITEBACK>
  <MESSAGE>
  <CHANGE>
    <REASON>...</REASON>
    <DELETES>...</DELETES>
    <TYPES>...</TYPES>
    <FEATURES>...</FEATURES>
  </CHANGE><CHANGE>
    <REASON>...</REASON>
    <DELETES>...</DELETES>
    <TYPES>...</TYPES>
    <FEATURES>...</FEATURES>
  </CHANGE>
   ...
</WRITEBACK>

The MESSAGE is set by the person, the REASON is set by the software,
perhaps with details using a controlled vocabulary ("split",
"merge", "creation", ...)

It feels to me like this gives essentially the same information
as explicitly listing how A comes from {X0, X1, X...} features.
Perhaps not exactly the same detail, but close enough for what
people want.  On the plus side it can handle complicated changes,
like if 3 features (ranges 100-300, 310-600, 620-800) are
converted into 2 (ranges 100-500 and 510-800)

<CHANGE>
   <REASON>merged three elements into two</REASON>
   <DELETES><DELETE uri="http://whatever/the/middle/one"></DELETES>
   <FEATURES>
    <FEATURE uri="http://whatever/the/left/one" ...>
       <LOC uri="http://whatever/seg1" range="100:500"/>
       ...
    </FEATURE>
    <FEATURE uri="http://whatever/the/right/one" ...>
       <LOC uri="http://whatever/seg1" range="510:800"/>
       ...
    </FEATURE>
   </FEATURES>
</CHANGE>


					Andrew
					dalke at dalkescientific.com