[DAS2] DAS intro
Suzanna Lewis
suzi at fruitfly.org
Sun Nov 27 01:24:07 UTC 2005
Lets add this to the agenda for Monday morning. Hopefully that will be
faster than via e-mail.
On Nov 26, 2005, at 5:20 PM, Andrew Dalke wrote:
> Suzi:
>> so there seem to be 2 questions. it would be good to have both in the
>> intro, but only as long as the description can be clearly stated in
>> just a sentence or two. If it takes more then it is clearly something
>> that requires a fuller description outside of the intro.
>
> Agreed.
>
>> I'll try to give my understanding (but goodness knows I am peering
>> through different lenses). I don't think in terms of the spec at all,
>> just the information that needs to be conveyed.
>>
>> #1 "reference frame" =========================================
>>
>> "reference frame", is (to my mind) "reference sequence". at least,
>> that is what i've always called it.
>
>
>> First, accuracy has nothing at all to do with it, so we don't want
>> the sentence in there.
>
> I'm fine with that. I've found it best to declare my ignorance early
> than to keep it hidden.
>
>> Second, the region of sequence that is returned is nothing more than
>> that. Think of it as a special type of feature. This is what makes a
>> transformation possible from one coordinate-system to another (by
>> adding the correct offsets)
>
> I can think of it as a feature just fine. But then shouldn't each
> region
> also be a feature? Why wouldn't all contigs be visible as an
> annotation?
>
> Contigs are in SOFA as
>
> @is_a at contig ; SO:0000149 @is_a@ assembly_component ;
> SO:0000143 @part_of@ supercontig ; SO:0000148
>
> What advantage is there to break this feature out at a "/region"?
>
> One that I can see is that the reference server provides the regions
> while the annotation server provides the other features. But if
> that's the case we could have the reference server also provide the
> regions as features, and the annotation server makes references to
> those features rather than to regions.
>
> That is, in the current scheme we have:
>
> <feature> has 0 or more <loc> element, where the 'pos' attribute
> links to region + start/stop range and the optional 'seq' attribute
> links to the sequence range, as in:
>
> <LOC pos="region/Chr3/1271:1507:1"
> seq="sequence/Chr3/1271:1507:1"/>
>
>
> <region> is only a link to the sequence and a length, as in:
>
> <REGION id="../sequence/ctg2/100:200" length="100" name="ABCDE" />
>
>
> One alternate possibility is to change that so "pos" points to a
> /feature (instead of a /region) and have features for each contig or
> other assembly component. The result would look like:
>
> <LOC pos="feature/AB1234/671:907:1"
> seq="sequence/Chr3/1271:1507:1"/>
>
> <FEATURE id="feature/AB1234" type="ABCDE_type"> ...
>
> Doing this, however, means that all features must support subranges.
>
>
> As an alternate solution without ranges, use
>
> <LOC pos="feature/AB1234" seq="sequence/Chr3/1271:1507:1"/>
>
> and then look up the sequence coordinates of feature/AB1234 to
> figure out where it starts/stops.
>
>
> The other advantage to a region is you can ask for the assembly
> via the 'agp' format. But because of the the existing support for
> formats which are only valid for some feature you can do that by asking
> for, say, all assembly_component features (via the feature filter) and
> return
> the results in 'agp' format.
>
>> Third, just think of "reference sequence" as a coordinate system. One
>> can have the exact same feature and indicate that: on
>> coordinate-system-A this feature starts and ends here, and on
>> coordinate-system-B it starts and ends there. Thus a feature's
>> coordinates may be given both on a chromosome, and on a contig, and
>> on any other coordinate-system that can be derived through a
>> transform from these.
>
> I believe I understand this. There really is only one reference frame
> for
> the entire genome sequence, for a given assembly, and all other
> coordinate
> systems are a fixed and definite offset of that single reference frame.
> I believe this is called the golden path?
>
> My reference to accuracy is because I figured that given two features
> A and B on an assembly component X then the fuzziness in the relative
> distance between A and B is small if X is also small. That is, smaller
> terms are less likely to have changes as the golden path changes.
>
>
>> So you could change the sentence below to read "A reference server
>> may supply features where the locations (start and end) are relative
>> to either contigs, some other arbitrary region, or to the entire
>> chromosome."
>
> Why not always supply it relative to the chromosome coordinates? The
> spec
> now allows that as an optional field. I can't figure out why you would
> want to do otherwise.
>
> Is it because sometimes it's easier to work with, say, a large number
> of
> contig reference frames than with one large reference frame? Does that
> mean we shift the complexity of coordinate translation from the data
> provider to the data consumer? (Making it easier to generate data than
> to consume data.)
>
>
>> This one is perhaps too subtle for the introduction, but if we decide
>> to include it then I think it should first be phrased in terms of the
>> problem (biological sampling) and then in terms of the solution
>> (multiple parents).
>
> Oh, definitely. It's some place where I just don't have the domain
> knowledge to explain it or even come up with examples.
>
> Andrew
> dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2
More information about the DAS2
mailing list