[DAS2] DAS intro

Tue Nov 29 19:30:41 UTC 2005

Andrew Dalke wrote:
> Ed Erwin:
> 
>> No.  The coordinate transformations are often more complicated than 
>> simple offsets.  The coordinate space for features on one contig can 
>> be 'backwards' with respect to a different contig, and the coordinate 
>> space for a gene may skip over one or more gaps with respect to the 
>> genomic sequence.
> 
> 
> The /region entities in the DAS/2 spec are defined as
> 
> <REGION> (zero or more)
> A top-level region on the genome (similar to the "entry points" of
> the DAS/1 protocol).
>     id – the URI of the sequence ID
>     length – length of the sequence
>     name (optional) – a human-readable label for use when referring
>        to the region
>     doc_href (optional) – a URL that gives additional information
>        about this region
> 
> Here is an example
> 
>    <REGION id="../sequence/ctg2" length="81918" name="VolvoxContig2" />
> 

I had to go back and look-up the context for this discussion.  Here it is:

 >> [Suzi wrote]
>> Third, just think of "reference sequence" as a coordinate system. One 
>> can have the exact same feature and indicate that: on 
>> coordinate-system-A this feature starts and ends here, and on 
>> coordinate-system-B it starts and ends there. Thus a feature's 
>> coordinates may be given both on a chromosome, and on a contig, and on 
>> any other coordinate-system that can be derived through a transform 
>> from these.
> 
 > [Andrew wrote]
> I believe I understand this.  There really is only one reference frame 
> for the entire genome sequence, for a given assembly, and all other 
> coordinate systems are a fixed and definite offset of that single 
 > reference frame.

I understand this as talking about coordinates in general, not the 
<region> elements or "pos" attributes in the spec.  Suzi specifically 
mentions chromosomes and contigs; one can definitely be backwards with 
respect to the other. But top-level regions in an assembly would 
probably all be chromosomes or all be contigs, rather than a mixture.

There is not one single "reference frame" for an assembly: rather there 
is one coordinate axis for *each* top-level region.  If those top-level 
regions are chromosomes, then there is no relationship between the 
coordinates on different ones.  If those top-level regions are contigs 
or ESTs (which I believe is allowed by the spec), then positions on one 
of them can be related to positions on others through various transforms.

 > This is a very simple definition.  As far as I can tell it does not
 > capture the information for, say, skipping.
 >
 > How would you represent "the coordinate space for a gene [that skips]
 > over one or more gapes with respect to the genomic sequence" using the
 > current DAS/2 object model?
 >
 > Or goes backwards?  I don't see anything like that.

You represent gaps with <FEATURE> tag parent-child relationships, and 
going backwards by specifying "+1" strand on one contig and "-1" strand 
on the other.

The spec does not requires a DAS/2 server to know how to perform 
transformations from one coordinate system to another, but your 
statement "there really is only one reference frame for the entire 
genome sequence" is wrong as I understand it.  There is one coordinate 
axis for *each* top-level region.