[DAS2] DAS intro
Lincoln Stein
lstein at cshl.edu
Mon Nov 28 17:24:32 UTC 2005
> <LOC pos="region/Chr3/1271:1507:1" seq="sequence/Chr3/1271:1507:1"/>
>
>
> <region> is only a link to the sequence and a length, as in:
>
> <REGION id="../sequence/ctg2/100:200" length="100" name="ABCDE" />
You know, this is still kind of ugly. I hate to revisit this so late in the
game, but can't we make sequence retrieval a three-step process?
1) Feature request returns:
<LOC pos="region/Chr3/1271:1507:1" />
2) Region request returns:
<REGION id="Chr3/1271:1507:1" seq="../sequence/Chr3/1271:1507:1" />
(where seq= could be an absolute URL if someone else owns the bases)
3) Sequence request then returns the bases
Lincoln
>
>
> One alternate possibility is to change that so "pos" points to a
> /feature (instead of a /region) and have features for each contig or
> other assembly component. The result would look like:
>
> <LOC pos="feature/AB1234/671:907:1"
> seq="sequence/Chr3/1271:1507:1"/>
>
> <FEATURE id="feature/AB1234" type="ABCDE_type"> ...
>
> Doing this, however, means that all features must support subranges.
>
>
> As an alternate solution without ranges, use
>
> <LOC pos="feature/AB1234" seq="sequence/Chr3/1271:1507:1"/>
>
> and then look up the sequence coordinates of feature/AB1234 to
> figure out where it starts/stops.
>
>
> The other advantage to a region is you can ask for the assembly
> via the 'agp' format. But because of the the existing support for
> formats which are only valid for some feature you can do that by asking
> for, say, all assembly_component features (via the feature filter) and
> return
> the results in 'agp' format.
>
> > Third, just think of "reference sequence" as a coordinate system. One
> > can have the exact same feature and indicate that: on
> > coordinate-system-A this feature starts and ends here, and on
> > coordinate-system-B it starts and ends there. Thus a feature's
> > coordinates may be given both on a chromosome, and on a contig, and on
> > any other coordinate-system that can be derived through a transform
> > from these.
>
> I believe I understand this. There really is only one reference frame
> for
> the entire genome sequence, for a given assembly, and all other
> coordinate
> systems are a fixed and definite offset of that single reference frame.
> I believe this is called the golden path?
>
> My reference to accuracy is because I figured that given two features
> A and B on an assembly component X then the fuzziness in the relative
> distance between A and B is small if X is also small. That is, smaller
> terms are less likely to have changes as the golden path changes.
>
> > So you could change the sentence below to read "A reference server
> > may supply features where the locations (start and end) are relative
> > to either contigs, some other arbitrary region, or to the entire
> > chromosome."
>
> Why not always supply it relative to the chromosome coordinates? The
> spec
> now allows that as an optional field. I can't figure out why you would
> want to do otherwise.
>
> Is it because sometimes it's easier to work with, say, a large number of
> contig reference frames than with one large reference frame? Does that
> mean we shift the complexity of coordinate translation from the data
> provider to the data consumer? (Making it easier to generate data than
> to consume data.)
>
> > This one is perhaps too subtle for the introduction, but if we decide
> > to include it then I think it should first be phrased in terms of the
> > problem (biological sampling) and then in terms of the solution
> > (multiple parents).
>
> Oh, definitely. It's some place where I just don't have the domain
> knowledge to explain it or even come up with examples.
>
> Andrew
> dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
More information about the DAS2
mailing list