What are regions for? (was Re: [DAS2] DAS intro)

Wed Nov 30 00:02:07 UTC 2005

Ed:
> I understand this as talking about coordinates in general, not the 
> <region> elements or "pos" attributes in the spec.  Suzi specifically 
> mentions chromosomes and contigs; one can definitely be backwards with 
> respect to the other. But top-level regions in an assembly would 
> probably all be chromosomes or all be contigs, rather than a mixture.

I'm trying to figure out when people use the /region.

In my way of understanding things there is the genomic sequence.
That consists of a set of chromosomes, each with a list of bases.

A chromosome is assembled from parts.  One of these parts is
called a 'contig'.  I thought I knew what it was, but according to
   http://staden.sourceforge.net/contig.html
there are several meanings.

What I understand is that a 'contig' is a sequenced chunk of
DNA which has overlaps with other contigs and when combined
can be used to deduce the entire sequence (excepting regions
of repeats and other ambiguities).  The best such deduction
is the golden path.

For DAS/2 we assume sequenced genomes.  When will people
use top-level regions which are not chromosomes?

Chromosome top-level regions are identical to the /sequence,
except for the ability to get the assembly and the sequence
data directly.  Is that correct?

The spec allows links from a feature into several different
regions.  This suggests to me that sometimes there will be
regions which are a mixture of contigs and chromosomes.
Else why support that ability?

There is nothing in the spec (that I know of) which allows any
hierarchy to the regions - all regions are top-level.  Is
this correct?

> If those top-level regions are chromosomes, then there is no 
> relationship between the coordinates on different ones.

While I understand that, I did get it wrong when I wrote it down.

In my head I was thinking "each base has a 1-to-1 mapping to a
number, and if two bases are next to each other then the corresponding
two numbers are next to each other."  This is invalid because the
converse is not true - if one number is the end of a chromosome and
the other is the start of the next then the two bases are not next
to each other.

>   If those top-level regions are contigs or ESTs (which I believe is 
> allowed by the spec), then positions on one of them can be related to 
> positions on others through various transforms.

Those are allowed.  Will people use them?  What advantage is there
to having these be a special category instead of a feature?

> You represent gaps with <FEATURE> tag parent-child relationships, and 
> going backwards by specifying "+1" strand on one contig and "-1" 
> strand on the other.

Something like this?  (Yes, this is hand-wavy)  Here's a <FEATURE>
(and note, this is NOT a <REGION>) with two subfeatures, one on the
forward strand and one on the reverse.

   <feature id="A">
     <part id="A.1"/>
     <part id="A.2"/>
   </feature>

   <feature id="A.1">
     <parent id="A" />
     <LOC pos="region/Chr3/1271:2917:1" />
   </feature>

   <feature id="A.2">
     <parent id="A" />
     <LOC pos="region/Chr3/5541:5523:-1" />
   </feature>

This I understand just fine.  I don't understand why the
positions are given in /region spec instead of either:

   - directly to /sequence space, eg

   <feature id="A.2">
     <parent id="A" />
     <LOC seq="sequence/Chr3/5541:5523:-1" />
   </feature>
     ...

-or-

   - point to a feature of type 'region' which provides the
        region coordinates

   <feature id="A.2">
     <parent id="A" />
     <LOC on="feature/contig1" />
   </feature>
      ...
   <feature id="contig1" type="region">
     <LOC seq="Chr3/5541:5523:-1" />
   </feature>

(Again, hand-wavy.  I think best looking at data and code.)

> The spec does not requires a DAS/2 server to know how to perform 
> transformations from one coordinate system to another, but your 
> statement "there really is only one reference frame for the entire 
> genome sequence" is wrong as I understand it.  There is one coordinate 
> axis for *each* top-level region.

Understood.

My questions, to summarize, are:
   - why do we need a /region space when we can
       1. point directly to a sequence (for chromosome regions) and/or
       2. point to a "contig" or "assembly" or "region" feature type
               (for other regions)

   - When would someone have regions which have more than one of
      contigs, ESTs and chromosomes?  Especially given that this
      is the genome spec, so chromosome-level info is known, at
      least enough for a rough assembly.

In other words, what are regions for?

					Andrew
					dalke at dalkescientific.com