What is /region for? (was Re: [DAS2] DAS intro)

Wed Nov 30 01:26:29 UTC 2005

(Changed the Subject line slightly to be a bit clearer. I hope.)

On Nov 30, 2005, at 1:37 AM, Ed Erwin wrote:
> Andrew Dalke wrote:
>> My questions, to summarize, are:
>>   - why do we need a /region space when we can
>>       1. point directly to a sequence (for chromosome regions) and/or
>>       2. point to a "contig" or "assembly" or "region" feature type
>>               (for other regions)
>
> The way I understand it, that is what region is for: to point directly 
> to a location on a sequence and/or contig.

Am I not asking the question correctly?  Am I missing the
obvious?  Been known to happen before!

I know what regions are.  I don't know why they are in
a distinct /region subtree.

I'm happy - enthusiastic - ecstatic - that there are different
ways to identify certain regions.  I fully accept that they
are in use every day and widely understood.

Why are they special enough to get their own /region subtree?
Why can't they be features?

Here's my proposal.  Leaf node parts of a <feature> always point
to a /sequence and optionally point to one or more /feature
elements which are of type "region".  (Or some other part of
SOFA - perhaps assembly-component?)

What to know where the feature is on a given "region" feature?
Then look up the region to find its /sequence location.  Use
these two /sequence locations to get the location in the region.
Both /sequence locations are in the same "coordinate space" of
"identifier + start/end offset"

BTW, if regions are a type of features then you can search for
them.  Eg, search for all top-level regions in the range 100000
to 2000000.  Can't do that with the /region container.  Can
if the region data is in the /feature container.

>>   - When would someone have regions which have more than one of
>>      contigs, ESTs and chromosomes?  Especially given that this
>>      is the genome spec, so chromosome-level info is known, at
>>      least enough for a rough assembly.
>
> I think they do it mainly 1) when the assembly is incomplete or 2) to 
> preserve annotations from the past when the assembly was incomplete. 
> There could be more reasons.
>
> Here is an example of a DAS/1 server that contains both chromosomes 
> and "other" short sequences as entry points:

Okay, I'm fine with that.  Thanks.

Is a goal of DAS to support incomplete genomes?

Note, btw, that the /sequence subtree does not need to contain
only chromosomes.  From the spec

   seqid is the sequence ID, and can correspond to an assembled
   chromosome, a contig, a clone, or any other accessionable
   chunk of sequence.

Hence for incomplete genomes, put the sequence data as
best you can under /sequence and have the /feature subtree
point to it.

>> In other words, what are regions for?

Still don't understand the need for a /region namespace.
Repeat: I understand regions, I just don't see why they
go in their own subtree and aren't part of some other data chunk.

Please, someone sketch out some example with hand-waving
XML that shows how having a /region is the appropriate solution.
That's what I'm worried about now - the representation in XML.

					Andrew
					dalke at dalkescientific.com