[DAS2] DAS intro

Andrew Dalke dalke at dalkescientific.com
Sun Nov 27 01:20:24 UTC 2005


Suzi:
> so there seem to be 2 questions. it would be good to have both in the 
> intro, but only as long as the description can be clearly stated in 
> just a sentence or two. If it takes more then it is clearly something 
> that requires a fuller description outside of the intro.

Agreed.

> I'll try to give my understanding (but goodness knows I am peering 
> through different lenses). I don't think in terms of the spec at all, 
> just the information that needs to be conveyed.
>
> #1 "reference frame" =========================================
>
> "reference frame", is (to my mind) "reference sequence". at least, 
> that is what i've always called it.


> First, accuracy has nothing at all to do with it, so we don't want the 
> sentence in there.

I'm fine with that.  I've found it best to declare my ignorance early
than to keep it hidden.

> Second, the region of sequence that is returned is nothing more than 
> that. Think of it as a special type of feature. This is what makes a 
> transformation possible from one coordinate-system to another (by 
> adding the correct offsets)

I can think of it as a feature just fine.  But then shouldn't each 
region
also be a feature?  Why wouldn't all contigs be visible as an 
annotation?

Contigs are in SOFA as

     @is_a at contig ; SO:0000149 @is_a@ assembly_component ;
         SO:0000143 @part_of@ supercontig ; SO:0000148

What advantage is there to break this feature out at a "/region"?

One that I can see is that the reference server provides the regions
while the annotation server provides the other features.  But if
that's the case we could have the reference server also provide the
regions as features, and the annotation server makes references to
those features rather than to regions.

That is, in the current scheme we have:

<feature> has 0 or more <loc> element, where the 'pos' attribute
    links to region + start/stop range and the optional 'seq' attribute
    links to the sequence range, as in:

    <LOC  pos="region/Chr3/1271:1507:1" seq="sequence/Chr3/1271:1507:1"/>


<region> is only a link to the sequence and a length, as in:

    <REGION id="../sequence/ctg2/100:200" length="100" name="ABCDE" />


One alternate possibility is to change that so "pos" points to a
/feature (instead of a /region) and have features for each contig or
other assembly component.  The result would look like:

    <LOC  pos="feature/AB1234/671:907:1" 
seq="sequence/Chr3/1271:1507:1"/>

   <FEATURE id="feature/AB1234" type="ABCDE_type"> ...

Doing this, however, means that all features must support subranges.


As an alternate solution without ranges, use

    <LOC  pos="feature/AB1234" seq="sequence/Chr3/1271:1507:1"/>

and then look up the sequence coordinates of feature/AB1234 to
figure out where it starts/stops.


The other advantage to a region is you can ask for the assembly
via the 'agp' format.  But because of the the existing support for
formats which are only valid for some feature you can do that by asking
for, say, all assembly_component features (via the feature filter) and 
return
the results in 'agp' format.

> Third, just think of "reference sequence" as a coordinate system. One 
> can have the exact same feature and indicate that: on 
> coordinate-system-A this feature starts and ends here, and on 
> coordinate-system-B it starts and ends there. Thus a feature's 
> coordinates may be given both on a chromosome, and on a contig, and on 
> any other coordinate-system that can be derived through a transform 
> from these.

I believe I understand this.  There really is only one reference frame 
for
the entire genome sequence, for a given assembly, and all other 
coordinate
systems are a fixed and definite offset of that single reference frame.
I believe this is called the golden path?

My reference to accuracy is because I figured that given two features
A and B on an assembly component X then the fuzziness in the relative
distance between A and B is small if X is also small.  That is, smaller
terms are less likely to have changes as the golden path changes.


>  So you could change the sentence below to read "A reference server 
> may supply features where the locations (start and end) are relative 
> to either contigs, some other arbitrary region, or to the entire 
> chromosome."

Why not always supply it relative to the chromosome coordinates?  The 
spec
now allows that as an optional field.  I can't figure out why you would
want to do otherwise.

Is it because sometimes it's easier to work with, say, a large number of
contig reference frames than with one large reference frame?  Does that
mean we shift the complexity of coordinate translation from the data
provider to the data consumer?  (Making it easier to generate data than
to consume data.)


> This one is perhaps too subtle for the introduction, but if we decide 
> to include it then I think it should first be phrased in terms of the 
> problem (biological sampling) and then in terms of the solution 
> (multiple parents).

Oh, definitely.  It's some place where I just don't have the domain
knowledge to explain it or even come up with examples.

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list