[DAS2] DAS intro
Suzanna Lewis
suzi at fruitfly.org
Sat Nov 26 01:44:54 UTC 2005
Hi Andrew,
so there seem to be 2 questions. it would be good to have both in the
intro, but only as long as the description can be clearly stated in
just a sentence or two. If it takes more then it is clearly something
that requires a fuller description outside of the intro.
I'll try to give my understanding (but goodness knows I am peering
through different lenses). I don't think in terms of the spec at all,
just the information that needs to be conveyed.
#1 "reference frame" =========================================
"reference frame", is (to my mind) "reference sequence". at least, that
is what i've always called it.
First, accuracy has nothing at all to do with it, so we don't want the
sentence in there.
Second, the region of sequence that is returned is nothing more than
that. Think of it as a special type of feature. This is what makes a
transformation possible from one coordinate-system to another (by
adding the correct offsets)
Third, just think of "reference sequence" as a coordinate system. One
can have the exact same feature and indicate that: on
coordinate-system-A this feature starts and ends here, and on
coordinate-system-B it starts and ends there. Thus a feature's
coordinates may be given both on a chromosome, and on a contig, and on
any other coordinate-system that can be derived through a transform
from these. So you could change the sentence below to read "A reference
server may supply features where the locations (start and end) are
relative to either contigs, some other arbitrary region, or to the
entire chromosome."
#2 "multiple parents" =========================================
It still is easier for me to think of this in terms of sequences. We
may know that somewhere out in the world a sequence must exist, but the
data/sequence we have collected is fragmentary. For example, thinly
sequenced genomes (resulting in many separate contigs) or a pair of
ESTs from an cDNA. In either of these cases we need to be able to have
the many to many relationships you talk about. This one is perhaps too
subtle for the introduction, but if we decide to include it then I
think it should first be phrased in terms of the problem (biological
sampling) and then in terms of the solution (multiple parents).
-S
On Nov 25, 2005, at 4:35 PM, Andrew Dalke wrote:
> Hi Suzi,
>
> You're supposed to be on holiday - it's Thanksgiving after all.
>
> Though I'm not celebrating it until next week. I wonder where
> I can find pumpkin pie mix here ...
>
>>> DAS/2 describes a data model for genome annotations
>> , THAT IS, DESCRIPTIONS OF FEATURES LOCATED ON THE GENOMIC SEQUENCE
>
> Changed, along with the other fixes.
>
>> (DELETED LAST 2 SENTENCES).
>
> That was the two lines about
>
>>> Portions of
>>> the assembly may have higher relative accuracy than the assembly as a
>>> whole. A reference server may supply these portions as an alternate
>>> reference frame.
>
> In the intro I want to mention all of the parts of DAS. The
> problem is that I still don't understand the /region request.
> These two lines were my best attempt at explaining them.
>
> Was the deletion because my understanding is wrong or because it's
> not needed for the intro?
>
> I think my confusion is related the concept you mention in:
>>> Annotations are located on the genome with a start and end position.
>>> The range may be specified mutiple times if there are alternate
>>>
>> SEQUENCES THEY MAY BE PLACED UPON (REFERENCE FRAMES).
>
> because I don't understand what I should change. I made up the
> term 'reference frame' because of my physics training. Is it
> the correct term here? Does 'reference frame' as it's normally
> used only refer to the full assembly or does it refer to each
> "/region" as well? If I give the coordinates on a contig can
> I say it's in the reference frame of that contig?
>
> (Hmm, David Block agrees with me, according to
> http://open-bio.org/bosc2001/abstracts/lightning/block
> The presence of a Tiling_Path table allows the loading of
> any arbitrary length of sequence, in the reference frame
> of any of the contigs that make up the tiling path. )
>
>
>
> I thought it was important to mention that a given annotation
> may have "several <LOC> tags if the feature's location can be
> represented in multiple coordinate systems (e.g. multiple builds
> of a genome or multiple contigs)"
>
> Then again, I don't understand how a given feature can be
> annotated on multiple builds because I thought that a feature
> was only associated with a single versioned source, and a
> versioned source has only one build.
>
>
> I would like to have something in the intro which mentions
> "/region". I just don't know how to do it. Why does anyone
> care about regions and not just point directly to the sequence?
>
>>> An annotation may contain multiple non-continguous
>>> parts
>>
>> (DELECTED PHRASE AND SENTENCE)
>
> The deleted text there was ", making it the parent of those parts.
> Some parts may have more than one parent."
>
> I put it there because I remember we talked a lot about this
> at CSHL a couple years back and wanted to make sure the data
> model handled cases where, say, there were two parents to three
> parts. I seems to me that that structure is important enough
> that someone who is trying to get a quick understanding of
> DAS annotations would be interested in it.
>
> My internal model for the expected reader is someone like
> Allen or Gregg - people who have some experience in data
> models for annotations and would like to know that DAS
> can handle those sorts of more complicated tree structures.
>
> I'm willing to move it further into the text, but I'm not
> convinced that it makes things less confusing or simpler.
> Features having parts and parents is an essential part of
> the DAS data model.
>
> Andrew
> dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2
More information about the DAS2
mailing list