[DAS2] DAS intro

Suzanna Lewis suzi at fruitfly.org
Sun Nov 27 01:24:07 UTC 2005


Lets add this to the agenda for Monday morning. Hopefully that will be 
faster than via e-mail.

On Nov 26, 2005, at 5:20 PM, Andrew Dalke wrote:

> Suzi:
>> so there seem to be 2 questions. it would be good to have both in the 
>> intro, but only as long as the description can be clearly stated in 
>> just a sentence or two. If it takes more then it is clearly something 
>> that requires a fuller description outside of the intro.
>
> Agreed.
>
>> I'll try to give my understanding (but goodness knows I am peering 
>> through different lenses). I don't think in terms of the spec at all, 
>> just the information that needs to be conveyed.
>>
>> #1 "reference frame" =========================================
>>
>> "reference frame", is (to my mind) "reference sequence". at least, 
>> that is what i've always called it.
>
>
>> First, accuracy has nothing at all to do with it, so we don't want 
>> the sentence in there.
>
> I'm fine with that.  I've found it best to declare my ignorance early
> than to keep it hidden.
>
>> Second, the region of sequence that is returned is nothing more than 
>> that. Think of it as a special type of feature. This is what makes a 
>> transformation possible from one coordinate-system to another (by 
>> adding the correct offsets)
>
> I can think of it as a feature just fine.  But then shouldn't each 
> region
> also be a feature?  Why wouldn't all contigs be visible as an 
> annotation?
>
> Contigs are in SOFA as
>
>     @is_a at contig ; SO:0000149 @is_a@ assembly_component ;
>         SO:0000143 @part_of@ supercontig ; SO:0000148
>
> What advantage is there to break this feature out at a "/region"?
>
> One that I can see is that the reference server provides the regions
> while the annotation server provides the other features.  But if
> that's the case we could have the reference server also provide the
> regions as features, and the annotation server makes references to
> those features rather than to regions.
>
> That is, in the current scheme we have:
>
> <feature> has 0 or more <loc> element, where the 'pos' attribute
>    links to region + start/stop range and the optional 'seq' attribute
>    links to the sequence range, as in:
>
>    <LOC  pos="region/Chr3/1271:1507:1" 
> seq="sequence/Chr3/1271:1507:1"/>
>
>
> <region> is only a link to the sequence and a length, as in:
>
>    <REGION id="../sequence/ctg2/100:200" length="100" name="ABCDE" />
>
>
> One alternate possibility is to change that so "pos" points to a
> /feature (instead of a /region) and have features for each contig or
> other assembly component.  The result would look like:
>
>    <LOC  pos="feature/AB1234/671:907:1" 
> seq="sequence/Chr3/1271:1507:1"/>
>
>   <FEATURE id="feature/AB1234" type="ABCDE_type"> ...
>
> Doing this, however, means that all features must support subranges.
>
>
> As an alternate solution without ranges, use
>
>    <LOC  pos="feature/AB1234" seq="sequence/Chr3/1271:1507:1"/>
>
> and then look up the sequence coordinates of feature/AB1234 to
> figure out where it starts/stops.
>
>
> The other advantage to a region is you can ask for the assembly
> via the 'agp' format.  But because of the the existing support for
> formats which are only valid for some feature you can do that by asking
> for, say, all assembly_component features (via the feature filter) and 
> return
> the results in 'agp' format.
>
>> Third, just think of "reference sequence" as a coordinate system. One 
>> can have the exact same feature and indicate that: on 
>> coordinate-system-A this feature starts and ends here, and on 
>> coordinate-system-B it starts and ends there. Thus a feature's 
>> coordinates may be given both on a chromosome, and on a contig, and 
>> on any other coordinate-system that can be derived through a 
>> transform from these.
>
> I believe I understand this.  There really is only one reference frame 
> for
> the entire genome sequence, for a given assembly, and all other 
> coordinate
> systems are a fixed and definite offset of that single reference frame.
> I believe this is called the golden path?
>
> My reference to accuracy is because I figured that given two features
> A and B on an assembly component X then the fuzziness in the relative
> distance between A and B is small if X is also small.  That is, smaller
> terms are less likely to have changes as the golden path changes.
>
>
>>  So you could change the sentence below to read "A reference server 
>> may supply features where the locations (start and end) are relative 
>> to either contigs, some other arbitrary region, or to the entire 
>> chromosome."
>
> Why not always supply it relative to the chromosome coordinates?  The 
> spec
> now allows that as an optional field.  I can't figure out why you would
> want to do otherwise.
>
> Is it because sometimes it's easier to work with, say, a large number 
> of
> contig reference frames than with one large reference frame?  Does that
> mean we shift the complexity of coordinate translation from the data
> provider to the data consumer?  (Making it easier to generate data than
> to consume data.)
>
>
>> This one is perhaps too subtle for the introduction, but if we decide 
>> to include it then I think it should first be phrased in terms of the 
>> problem (biological sampling) and then in terms of the solution 
>> (multiple parents).
>
> Oh, definitely.  It's some place where I just don't have the domain
> knowledge to explain it or even come up with examples.
>
> 					Andrew
> 					dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2




More information about the DAS2 mailing list