[DAS2] Notes from DAS/2 code sprint #3, day five, 18 Aug 2006

Steve Chervitz Steve_Chervitz at affymetrix.com
Fri Aug 18 19:15:33 UTC 2006

Notes from DAS/2 code sprint #3, day five, 18 Aug 2006

$Id: das2-teleconf-2006-08-18.txt,v 1.2 2006/08/18 19:14:11 sac Exp $

Note taker: Steve Chervitz

  Affy: Steve Chervitz, Ed E., Gregg Helt
  Dalke Scientific: Andrew Dalke
  UCLA: Allen Day, Brian O'Connor
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

Topic: Spec concerns
ad: segments doc (not 'segment') top-level element is missing three
fields, one is uri (I added). second is reference (a collection
corresponding to a dataset). seemed less useful since it's already
mentioned in vsource document. I added id to schema, not spec
yet. last thing: missing a doc_href, for each segment ok, but we can't
say, here's doc for human.
gh: optional?
ad: yes.
gh: if optional doesn't change server impl.
uri for segments is specified in segment capability.
gh: my only objection is spec churn.

gh: question about writeback spec: what you're supposed to do if you
remove an exon from a txt, you are supposed to have a delete element
in post that deletes that id.
ad: yes
gh: if you just have that delete, does that force parent to remove
it's child, or do you also have to have the parent in there?
ad: everything in that relation has to be sent.
gh: in that example, if you have a delete for that exon, you have to
return the rooted hierarchy as well with txt not having that part
ad: yes
gh: what if you create a curation with three exons in it, you then
decide to delete the middle exon. server gets post with same
annotation, but exon is missing and parent is not pointing to it as a
part. is that legal?
ad: nothing that says delete?
gh: no
ad: i think it should be illegal.
if you have three generations. grand parent and grand child with no
intermediate. also illegal.
gh: server will have to catch these things.
ad: easy. just check whether all ids involved are representing
something on the server, if so, you delete old, update new.

gh: allen, will your server catch this?
aday: if you modify something, it already has to check before it gets
deleted, i can just reject it. now I say, you modified it, here are
the things that are modified by your request.

gh: [drawing] 
d:a-----b-----c  ->  d:a----------c , b

read this as: transcript d has exons a,b,c

three exons attached to a txt, never indicated that anything was
deleted, I just re-wrote the feature as a--------c
gh: this should throw an error, since you didn't explicitly delete b.
aday: what's wrong with leaving d dangling?
ok to not mention the missing exon
ad: one is to keep it there, one is to delete automatically,
gh: if keep does it have pointer to parent?
that's enough to tell db it's not connected?
aday: yes, it becomes an orphan.
you should get back a message, "hey you affected all of these
features." so client can see what your modification affected.
you'll know from response what was affected by deletion you
gh: if you now submit a new transcript named e containing a and c:
ad: so annotation 'd' will come back as saying, "was deleted"

aday: my response tells you everything that needs to be updated.
you might see things that need to be cleaned up that weren't expected.
ad: python maxim: when in doubt, refuse temptation to guess.
you're guessing it makes sense to leave orphans around

ad: if it's ambiguous, should be not supported.
gh: from allen's side, it might be hard to catch and call error.
aday: no I can catch. i track all changes caused by client request. I
have to track all changes made, see if it was present in the submitted
document, if not, an error. just another level of tracking. can do.

gh: if this is what you wanted to do, client would submit, write b
(with no txt as parent), write d with txt as parent. and no delete
to get this
d:a-----b-------c   ->  a-----------c + b

gh: if you really want to get rid of children, you need to specify
both parent and child.
gh: approach on client. I do on client. curational model is that you
are never really editing locations or parent child relations ships,
you are just making successors, so I keep this version chain. not
deleting old ones on server (that is the plan though).
aday: every edit does a delete and create on server. that's very
can you keep track of it in memory.
gh: yes. user has to request writeback. any number of edits between
one and the next. once you've committed you can rollback on
aday: everything is pruned off in client?
gh: no you need redo.
aday: redo is not considered saved unless you save again.
gh: if you re-edit after a undo, you can't redo. no branching.
aday: just keep track of recent save point.
gh: todo: keep modification dates. so if there were no edits since the
last save then there's no need to write back to the db again.

gh: if you want something deleted, you must explicitly do it.
if you want to delete it do this:
   * delete b
   * write d:a------------c

if you want to orphan it do this:
   * write b with no parent
   * write d:a------------c

Topic: semantics of insides and overlaps as they relate to parent-child
gh: this is a continuation from yesterday's discussion we had offline.
bring up spec, feature filters. see part that says, "any part of a
complex feature that is one with parents... then all parts are
returned". that's wrong.
you do an insides query, you only get back things that are inside. two
exons in a txt, one is inside, one is not inside.
gh: if it has no location, it's never going to be returned by a range
ad: by type q
gh: if multiple locations on the feature.if one of those locations is
inside the range query it passes.
gh: not the same as
multiple locatsion -- aligns to multiple places in the genome.
top level parent of a feat hierarchy must have a location that passes
one of the location in the range query.
one of the locations has to pass the range filter. and it is at the
top level of the hierarchy.
aday: think of this: locations are cols in matrix, filters are
rows. in order for column to qualify, the entire row must be true.

ad: different people may have modeled it differently. may get only
part of it back.
gh: if two servers model the same data differently you may get
different answers back. that's the way it goes.
ad: annotation contains features. returns all annotations that match
the query.
gh: don't add notion of some other object that is sort of a feature,
but is really a group of feats.
aday: i call it a feature group. range filters operation on the group.
gh: we don't need to have a special designation. it's just a feature
with no parents. what your're calling a feature group.
aday: all things under the parentless feature is the group.
ad: yes
aday: not identical to the root, it's the root plus all attached
gh: to clarify things in the spec, maybe call it annotation/feature
group, maybe ok.
ad: all things connected by a parent-part relationship. return the
entire feature group.
gh: change: root of the feature hierarchy matches (range filters) the
root of the group has to pass all the feature filters in the range
ad: you want the root to be guaranteed to have locations if any sub
feats have location. featureless roots.
aday: no way to retrieve based on location. weird. parent with no
gh: not weird. bounds of gene are fuzzy. they'll spell out bounds of
exon but not the gene
we can say the highest level with location.
we can say that if children has location, then parent has.
ad: put all children ranges in the root.
gh: ok. no children should never have locations outside their parent.
ad: old conversation: is this single or multiple rooted. single is
easier to understand. but there is a use case for multiple locations.
now we say the single root must be union of all it contains.
gh: inclusive, not necessarily union.
ad: software check will be needed
gh: you don't want someone submitting exons that are outside bounds of
a transcript. dangerous to have children outside location of parent.
aday: true for bioperl
ad: for only root, or intermediate?
aday: every intermediate
gh: only acceptible if you want to punt on location of upper level
thing whose location isn't well understood (gene).
aday: feature 100-200, locationless thing attached to it..
gh: if you have locationless, they need to be locationless up to the
maybe we should not allow that for now.
if you have a locationless feature, it's locationless all the way down
and all the way up. meets requirement for gene das.

ad: don't understand why this restriction needs to be there.
ee: we want it.
gh: you cannot have children outside bounds of their parents and their
parents recursively. to me, that needs to happen. question: can you
have children with location that have parents that are locationless?
ad: why parents that don't overlap child location?
gh: throws off our range filter mechanism. no easy answers to
ad: if any children meet criteria, then they all get returned.
gh: they you get back features that don't meet
sc: lets say you're editing an exon...
gh: forget editing. just basic reading. there was ambiguousnes in old
spec here that I want to kill.
I've seen desire to have locationless thing above, but never the
reverse: definitive location above but locationless below.

gh: we hashed this out in last code sprint. let's complete it!
ad: if any feature matches, then all features match. includes the
situation if parent has no location, but child matches, that implcitly
my proposal was to return all things in feat group if any one of the
features match. same as assuming all parents have location of their
children. this search will get back the parent.
returning the feat group is a way to say all parents implicitly
include locations of their children.
aday: not all parents, multiple roots.
gh: they all must go to a single root.
aday: if any location of the root of group matches, then the whole
group matches. 
boils down to: are descendent feats are allowed to be outside the
bound of parent.

gh: [insides query example on board]

aday: the query is on the feature group root features
ad: I don't remember allowing range queries being allowed only on root
two exons that are very far apart. query hits in between them.
gh: parent meets overlap, return them all.
ad: parent has only two small locations, not one large locations.
gh: modeled as multiple small locations, not child features.
sc: so it doesn't include the interveneing sequence.
gh: cannonical example of mult location stuff: 25mer probe that hits 4
diff locations in genome.
multiple alignments, where none of the alignments align to the
whole thing.

aday: two probe pair, only some of the children are in the region.
ad: example: protein structure catalytic group, three residues on
different chains. 

gh: mult locations of probe set, one location falls inside query,
return the probe set
why can the rule be

ad: besides range searches: when you find that a feature matches title
or curator name, do you return back just the matching feats or the
gh: don't see why we can't add more rules.
aday: name search and exon is named, return it's parents.
ad: so for any searches besides ranges, it returns all features in the
feature group.
gh: different behavior for range queries.
they already have different behavior that other queries.

ad: my criteria, if any feature matches, then all features in group
are returned, except that in range query, only this that match the
range query are returned.
gh: don't see why you have a problem with that requirement.
ad: do the search on all features, root is not special, if any feat
match, get all features in group, if a range filter, then get features
that pass. if a filter, then full hierarchies are not returned, only
those that pass filter.
gh: don't like. do an overlaps, two exon are in, two are not. you send
back only the txt and the two that are, you are depriving user of
data, there's no way of know that it's missing, how can they get at
ad: i'm confused. in system you want, you return back everything?
gh: yes. everything that has a root with one location that matches all
range filters. if the root of the feat group meets range criteria for
at least one of it's locations.
aday: and any name filter
ad: root has no location info, but one of exons overlap, whole thing
ee: distinction between olap and includes, different if parent lacks
location info.
aday: gregg needs for range optimizations. name may matches, but feat
location may not, but root of group may
ad: specified in root node. not convinced we need locationless
features that aren't descented.

gh: we're not talking about locationless nodes now.
parent has location, that's all you need to search on.
ad: use pieces, or whole range?
gh: the whole range, not piece by piece.
ad: why 
aday: there can be things

gh: I argued against having mult locations, caused problems in
bioperl, children with locations, and mult locatable features. so I
didn't want to have mult locations, but got voted down.
only thing it makes sense: when you want one feat to represent one
feature to represent an alignment to things on genome. OK to represent
with mult locs, but better to not.

aday: offsets relative to the root.
gh: no. will confuse people a lot.
ad: any annotations that will go on mult segments in dna world.
aday: blast results, very common.
gh: every blast hit is a separate feature, avoids the problem.
I use them in transforms, so I can say this feature maps to different
genome assemblies.
fine in a data model. but causes problems when it's in a spec, hard to
describe when you should use one vs the other.
aday: what rules do you use internally?
gh: i know it when i see it.
ee: in genometry, these are equivalent regions on these genomes.
gh: right. the length of the range is the same
length can be identical, but seq is different.
genometry doesn't care about sequence identity.
"this part of hg17 is equivalent to hg18".
but this is getting tangential.

ad: question is what do you do for things that are mult segments.
example where parent is wider than children
aday: you don't know where 3' end it
gh: haplotype block for a set of snps, you know it extends to the next
block, so the block is bigger than the bounds of the snps used to
construct it. 
ad: curation tool, marked off three regions, one thing can extend over
a broader range. tool automatically inserts. allows curator to stretch
it out as need be.
sc: this is what fuzzy locations are used for at genbank.
gh: we don't have fuzzy locs. no needs for these at present.

ad: implicitly the parent is the min-max ov its children. a db could
optimize that way.
curation tool gets data back from server. does curation tool know to
change the parent range or not?
gh: it better
ad: if user changes the min/max exon bounds, will tool know to adjust
parent transcript? the txt could be left extending past the current
location of these.
gh: up to the client app to figure it out. a smart gui should say, you
cannot extend the txt past the exons you have, but for a genotype
block, it might allow such a change. in theory, your client would
understand what elements in the sequence ontology you could do it and
what you could not.

ee: this is outside the spec. should say it's possible for parent to
extend beyond bounds of children, and not possible for childre to be
outside of parent.
ad: which of these can be on multiple segments?
gh: if we're going to have mult locs, then everything can.
ee: if child can, then parent can.
aday: an argument for doing relative offsets I suggested. only allow
parents to have relative offsets to children. no duplication of data.
gh: duplication of data is a red herring.
ad: more error prone to checking a string to see if it matches.
hard to extend the parent to be a bit wider than children,

gh: range queries to apply to root of featu hierarchies, and at least
one of the children to pass all range filters?
ad: why is this diff than requirement I gave?
gh: your's give back partial feature groups. it's allowing filters to
apply to any of the children , not just the root.
ad: only difference is if you have two widely spaced features,
everything has an implicit convex hull. if your query hits the

gh: [whiteboard drawing]

      +-----------+        exon a in transcript c
      +----------------+   exon b in transcript c
           inside query

ee: for overlaps you would include the parent, for inside query you
would not.

ad: how will software guarantee this? min-max or just union of the
ee: min-max of all children.
ad: should be in the spec.
gh: allen: how do you do min and max of mRNA, implicit or explicit? for me,
it's explicit.
aday: explicit. 
ee: using gff1 where it's implicit, but our parsers force it to be
explicit in our data model.
aday: in gff3 it can be implicit (using '.').
gh: gff, bed, psl, xml formats, raw blast output -- all explicit.

ad: does server verify that it meets this criteria. each feature
comming in, if it has parent it can only have one segment id.
for eeach segment in the parent, find each one that matches the range
in the child, 

if any child has segment x,
only one location on segment x
aday: can have mult locs on the same segment.
ad: why not model as one range?
aday: need to create the parent in two locations.
gh: as long as one loc of parent contains the loc of the child, it's
ad: gregg saying that
aday: location only includes one instance of the children. two
locations for exon a, b, c. first set of locations for these exons is
different than the second set of locations for these exons. a logical
grouping not simple collection of all parts.

mult locations on the same segment is harder.
check location of parents, rify that no two childs.

ad: spec now allows for dumb servers. by putting this extra
requirements, it doesn't make server easier, complicates life on
gh: it makes clients life simpler.
aday: location as two additional attribs: group, rank
group - groups things together that are in the same segment
rank - prioritized location
conceptual grouping of things, to know which child locs match up with
which parent locations, because locations can overlap.

gh: (aside) can you make them multiple feats rather than diff locations?
it comes out as das2xml.

ad: need to mention to lincoln and berkeley folks. specify what the
algorithm is to

Topic: status reports

gh: doing writeback to allens writeback server. create new annot, edit
location, add, remove, extend exons, can write them all back. keeps
creating new features in the db instead of editing the ones that are
there. plan: delete the old annot in the same doc that edits the new
aday: so you're leaving lots of old annots around.

aday: finishing touches. old uri - new uri mapping, so gregg
knows. fixing bugs on writeback server. working on new das front end
that takes incoming reqest , breaks down with modulus operation with
configurable blocks size, filters the results, this is for
caching. working well. can convert the typical 40-50s response times
down to 7s on a single megabase region. takes a while to get cache
populated. todo: automatically populate cache. add code to know when a
block became stale, so server can flush cache to get new stuff.

bo: refactor domain factor response. found lots of hardcoded
logic. went back to refactor. one object that populates hash structure
of objects, handles.
support for wiki stuff from lincoln, unique coord identifiers.
todo: go ahead and update test suite now out of date. coord filter
needs to be added in.

gh: server now supports full type uris and segment uris?
bo: yes, in cvs. todo make rpm package and install on production
gh: then public release of igb can start using full type uri.
bo: can communicate with you on it.

gh: congrats -- end of code sprint. good to get the writeback stuff
going. spec changes are little, but feels very nailed down.

ad: finished off action items from yesterday. timestamp. reference
server implementation.

ee: still working on gff3 parser. progress nothing to report.

sc: updated affy probe set alignments for drosophila arrays to be
based on dm2 on our das/1 server (Ann's request). Restarted
server. Worked on updating the affy das server info page in progress.
todo: update the das2_server with latest improvements committed by
gregg, then test the new and improved bp2 format for exon data. will
need to deal with array prefix used by netaffx ('1:') rather than as
used in CHP files ('HuEx:').

Post-teleconference Discussion

gh: would you be willing to give up multiple locations in the spec?
aday: would you be willing to give up bidirectional parent-child pointers?
gh: let me think about it...

More information about the DAS2 mailing list