[DAS2] Notes from the DAS/2 teleconference for the code sprint, 10 Feb 2006

Fri Feb 10 22:10:28 UTC 2006

Notes from the DAS/2 teleconference for the code sprint, 10 Feb 2006

$Id: das2-teleconf-2006-02-10.txt,v 1.1 2006/02/10 22:13:17 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  CSHL: Lincoln Stein
  Sanger: Thomas Down, Andreas Prlic
  Sweden: Andrew Dalke
  UCLA: Allen Day

Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

[note taker missed the first 5 minutes]

Topic: Properties
-----------------

gh: Properties are all tag-value
ad: yes
gh: don't think we need your binary thing.
ad: ok drop it
gh: href is needed. can always point it to a binary something out there.
can the value just be a url?
ad: can make it relative to xml base
gh: do you need some property with tag value and href at same time?
ls: how would you interpret that? should be either value or href.
ad: there's nothing to say how to interpret the url.
gh: nice to have multiple links out to somewhere else and to have some
indication what they are w/out traversing the link. e.g., this is the
genbank ref, ensembl ref, protein, etc.
if xid had an extra field with label, title e.g. that would suffice.
ad: sounds ok

[A] xids will have title + href, properties will have tag + value

Topic: Exercising the spec
---------------------------

gh: we need the reference server to actually exercise this part of the
spec. xid. possibly other things like: target overlap, inside, cigar
strings. encoding, decoding.
aday: oh no. 
ls: line element. cigar string is something that no one has tested yet.
gh: if we don't have server doing it by next code sprint
aday: any impls out there we could use?
gh: bioperl has a gff3 parser.
aday: I wrote it, and I didn't impl cigar string parsing.
ls: there's a cigar processor in bioperl AlignIO. in theory not hard
to do. 
gh: lbl folks (Nomi et al) have a java one, too. I think.
gh: other parts of spec that aren't getting exercised? I doubt if
anyone has used xml lang.
ad: added xml id. just there for other reasons, but not what we need
it for.
gh: we talked about all ids being xml ids and combing xml id and xml
base, can't remember why we stopped discussing.
ad: don't think we need to. style sheet has uses for this maybe.
ad: has anyone generated doc href yet?
td: can add this stuff easily now.
gh: for testing purposes, just throw a doc href everywhere it's
allowed.
ad: are servers supporting retrieval of seq data?
aday: yes
ad: support for alt feature formats?
aday: can do old compact formats, not sure about coverage.
gh: yes, alt feat formats are handled, but server isn't up and running
yet. igb das/2 client can handle it already.
ad: retrival of assembly?
aday: no assembly data
ad: i don't touch assembly
gh: may be for next code sprint.

Topic: range based query
------------------------

gh: thomas and i don't like optional mins and maxes.
ls: fine as long as you can always determine the size of the
reference. provide beginning and end.
gh: exception: if you want the whole sequence, can you just not supply
range?
ad: yes
gh: :1 and :-1 how to interpret nothing for strand on end and 0 for
strand at end?
ls: features that have strand +1, -1, features that have no strand or
on both strands (0) features that may have a strand but you don't know
(empty)
gh: when you put it in the query there's a differences between i don't
know and i will accept anything.
use case: transfrags from transcriptome project. unknown strand, but I
know it *is* one or the other strand.
ls: how about this arrangement:
 empty = i don't care
    0  = has strand but i dont know
    1  = forward strand
   -1  = reverse strand
    2  = both strands
ad: could be organized by track (everything in a track has same strand.
gh: don't think is good to structure a query so it's required that you
do have strand. you might could have diff strand designation on same
track. 
ls: you want to be able to distinguish things that are on both
strands, things that are on either strand, but you don't know which.
gh: biggest concern: given a range based query to server
1000-2000 means everything that overlaps, any strandedness within this
range.
ad: should support stranded searches. client can filter out
opposed to do a strand request against seq to get the rev comp. client
should be able to do this.
gh: in range attrib of features, you can add colon to indicate
strandedness.
ad: yes
gh: if no :strand does this mean unknown or don't care?
ls: defaults to *, anything. you get fwd, rev, don't know, don't care.
gh: required things on fwd strand to be :1, not make it a default.
ad: ok. if not there, means ambiguous, unknown, or not
appropriate. see email i sent.
if you get rid of search for strand in region query, most of this
issue goes away.
gh: don't think people would use this often (stranded query)
ad: you can make two queries to server instead of one.
gh: this is a resolution for all range-related issues.
ad: check my email to make sure it covers this.

[A] everyone review andrew's email re: range queries and strand issues.

gh: also or-ing of diff range-based queries is not useful for me.
I mainly need intersects of overlaps and inside. or-ing is equivalent
to using multiple queries.
td: why do you need and overlaps and inside?
gh: optimization on client side. keeps track of what it has
received. wants to minimize re-fetching.
td: can you just use overlap and not overlap?
gh: that may be equivalent, but the way I do it, you can guarantee you
never get the same feat twice with that combo. will require and-ing of
two range-based queries.

ad: modifying query lang, or-ing together two. include first range and
include second range should use multiple query keys because of the
comma. you will have to escape any comma if it's inside of query
string. 
gh: don't like the implicit 'and' if different but 'or' if keys the
same. it depends on the query.
ad: now all queries are and-ed, but commas mean multiple.
ls: comma syntax seems natural. the occasional query that had to have
an escaped comma didn't cause any bother.
td: this was as it is in das/1. exons and repeat. type=exon,
type=repeat. so the suggestion is to use the das/1 behavior.
ad: three independent segments
gh: types as well. can have any number of types= and segment= all
or-ed together. I still need anding of overlaps and inside.
td: different key are or-ed, same keys are and-ed.
ls: hoisted by my own petard here. works for me.
gh: allen?
aday: what's changed?
ls: the whole query language has changed in a fundamental way.
aday: dealing with multiple attributes with same name. fine.
gh: will server accept full urls for types?
aday: not now but will impl this.
gh: all types should be full uri's now. my client can't deal but will
soon.

Topic: status reports
---------------------
gh: state what what you hoped to accomplish and what you actually
accomplished. 

gh: hoped to get igb das client up to date with spec, working with one
das2 server, and get affy das2 server up and going.
affy das2 server will take longer. maybe by next code sprint.
igb is now using latest das2 spec, calling allen's server, and using
registry as well. happy with results. not everything done, but some
unexpected things (registry).
wrote up progress report for grant: going out 3pm today (we got
another day) a 2pg summary. will send out to everyone later.
todo: get das2 server up. client: deal with full uri issue. this is a
basic fuctionality of the client. smart handling of uris.

ee: igb client. big thing is make it treat all data sources too all
behave similar way das1/das2, quick load, separate files, regardless
of the data format. want to make it all seamless. going well.

sc: streamlined pipeline for populating das sever with affy exon array
data. didn't get to pipeline for external data (UCSC tracks), but have
basic framework in place.

ad: decided to do more writeback at next sprint. when is next sprint?
gh: march 13-17. lincoln will be in UK and can participate from there.
ad: I'm in the states next week. will come to emeryville for next
sprint.

[A] next code sprint is 13-17 March. Mark your calendars.

ad: hoped to work on spec, resolve detailed questions, make sure it works
with people's needs. will work on incorporating latest ideas into spec.
validator: have one but is not fit for public consumption. not at
where it was last summer on the previous version of spec.

ap: das interface for registry, can serve das1 and das2 sources w/ new
source command. java client - not yet. registry: todo UI so users can
upload to das registry.

td: hoping to write server. got something up for feat, types,
segments, need to run through andrew's validator. hope to work on
writeback, but didn't happen (but good discussion on it). want to get
more data included, ensembl database.
roy has been working on zmap client, coming along fine.

aday: primary goals: to support new version of spec -- not fully done
uri problem in query parsing. apache config integration is
done. installation and rpm for server - done for FC3 i386, available
in the next couple of days (brian o'connor). general documentation
improvement in code for server - not done.
Next step: post, put, delete, writeback framework (originally planned
this but may need to rethink),  impl transaction logs (maybe in
flux). adding more unit tests.
ad: writeback spec won't happen for at least 2 weeks. need to write up
what we've done on current spec first.

ls: will be available from 14th on. at ensembl meeting up to the 13th.
gh: allen come to emeryville?
aday: maybe.
gh: will have to explore how to fund hosting folks here for next
codesprint. 

gh: speaking for nomi - she had apollo working for parsing features
and displaying them. some issues with higher level integration into
apollo. making good progress.

gh: time to wrap it up. thanks for your hard work.
[applause]

[A] next teleconf will be on 20 Feb, 9:30 PST 5:30 UK (regular time)
we're skipping 13 feb (next monday) given all our time this week.