[DAS2] best practices / DAS2 format examples
Andrew Dalke
dalke at dalkescientific.com
Mon Sep 11 17:52:35 UTC 2006
das2-teleconf-2006-03-16.txt
> [A] Lincoln will provide use cases/examples of these features
> scenarios:
> - three or greater hierarchy features
> - multiple parents
> - alignments
I really would like some real-world examples of these. I don't know
enough to make decent examples for the documentation and I think it
would be very useful so others can see how to model existing data
in DAS2 XML.
I looked at GFF3 examples to find existing properties which must be
storable in a DAS2 feature document. Here are two example lines
ID=FBti0020396;Name=Rt1c{}1472;Dbxref=FlyBase+Annotation+IDs:
TE20396,FlyBase:FBt
i0020396;cyto_range=102A1-102A1;gbunit=AE003845;synonym=TE20396;
synonym_2nd=Rt1c
{}1472
ID=FBgn0004859;Name=ci;Dbxref=FlyBase+Annotation+IDs:CG2125,FlyBase:
FBan0002125,FlyBase:FBgn0004859;cyto_range=102A1-102A3;
dbxref_2nd=FlyBase:FBgn0000314,FlyBase:FBgn0000315,FlyBase:
FBgn0010154,FlyBase:FBgn0010155,FlyBase:FBgn0017411,FlyBase:
FBgn0019831;gbunit=AE003845;synonym_2nd=Ce,Ci,CI,ci155,ciD,ci-
D,CiD,CID,ci<up>D</up>,Ci<up>D</up>,Cubitus+interruptus,cubitus-
interruptus-Dominant,l(4)102ABc,l(4)13,l(4)17
I do not know this domain well enough. I do not how "cyto_range" should
be stored in DAS2 XML nor gbunit. I don't know the difference between
dbxref and dbxref_2nd. Nor can I find documentation on these
properties.
Looking around I came across names
cyto_range Dbxref dbxref_2nd Name Parent species gbunit Alias
but I don't know how those are best modeled in GFF3. For example, is
species redundant given that we know that from the reference sequence?
I want someone to be able to go to DAS and easily figure out how to
convert existing data models into DAS's model.
Here is an example of a real-world GFF3 complex annotation, which we're
calling a "feature group" in DAS2. The top-level is a gene. It has one
child which is an mRNA. The mRNA has children of CDS, exon, protein,
and
intron. I've added newlines for readability.
4 . gene 22335 23205 . - .
ID=FBgn0052013;
Name=CG32013;Dbxref=FlyBase+Annotation+IDs:CG32013,FlyBase:
FBan0032013,FlyBase:
FBgn0052013;cyto_range=101F1-101F1;gbunit=AE003845
4 . mRNA 22335 23205 . - .
ID=FBtr0089183;
Name=CG32013-RA;Parent=FBgn0052013;Dbxref=FlyBase+Annotation+IDs:
CG32013-RA,
FlyBase:FBtr0089183;cyto_range=101F1-101F1
4 . CDS 22335 22528 . - .
Parent=FBtr0089183;
Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA
4 . exon 22335 22528 . - .
Parent=FBtr0089183
4 . protein 22338 23205 . - .
ID=FBpp0088247;
Name=CG32013-PA;Parent=FBtr0089183;Dbxref=FlyBase+Annotation+IDs:
CG32013-PA,
FlyBase:FBpp0088247,GB_protein:AAN06536.1,FlyBase+Annotation+IDs:
CG32013-RA
4 . intron 22529 22616 . - .
Parent=FBtr0089183;
Name=CG32013-in
4 . CDS 22617 23205 . - .
Parent=FBtr0089183;
Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA
4 . exon 22617 23205 . - .
Parent=FBtr0089183
The direct conversion to DAS2 xml the way I've been doing it is first
defining a TYPES document like this (the das-private: identifiers are
created upon server upload). Note that I'm storing the GFF3 fields in
a PROP element so I can easily figure out which DAS2 types correspond
to the GFF3 types (unique gff3 types is the pair (type, source) )
<TYPES>
<TYPE uri="das-private:T8">
<PROP key="gff3-type" value="gene" />
<PROP key="gff3-source" value="" />
</TYPE>
<TYPE uri="das-private:T9">
<PROP key="gff3-type" value="mRNA" />
<PROP key="gff3-source" value="" />
</TYPE>
<TYPE uri="das-private:T10">
<PROP key="gff3-type" value="exon" />
<PROP key="gff3-source" value="" />
</TYPE>
</TYPES>
Given the types, the features document looks like.
<FEATURE type="das-private:T8" uri="das-private:F233" title="CG32013">
<LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="23205
" strand="-1"/>
<PART uri="das-private:F232"/>
</FEATURE>
<FEATURE type="das-private:T9" uri="das-private:F232"
title="CG32013-RA">
<LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="23205
" strand="-1"/>
<PARENT uri="das-private:F233"/>
<PART uri="das-private:F234"/>
<PART uri="das-private:F235"/>
<PART uri="das-private:F236"/>
<PART uri="das-private:F237"/>
<PART uri="das-private:F238"/>
<PART uri="das-private:F239"/>
</FEATURE>
<FEATURE type="das-private:T8" uri="das-private:F233" title="CG32013">
<LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="23205
" strand="-1"/>
<PART uri="das-private:F232"/>
</FEATURE>
<FEATURE type="das-private:T25" uri="das-private:F234"
title="CG32013-cds">
<LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="22528
" strand="-1"/>
<PARENT uri="das-private:F232"/>
</FEATURE>
<FEATURE type="das-private:T10" uri="das-private:F235">
<LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="22528
" strand="-1"/>
<PARENT uri="das-private:F232"/>
</FEATURE>
<FEATURE type="das-private:T24" uri="das-private:F236"
title="CG32013-PA">
<LOC start="22337" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="23205
" strand="-1"/>
<PARENT uri="das-private:F232"/>
</FEATURE>
<FEATURE type="das-private:T27" uri="das-private:F237"
title="CG32013-in">
<LOC start="22528" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="22616
" strand="-1"/>
<PARENT uri="das-private:F232"/>
</FEATURE>
<FEATURE type="das-private:T25" uri="das-private:F238"
title="CG32013-cds">
<LOC start="22616" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="23205
" strand="-1"/>
<PARENT uri="das-private:F232"/>
</FEATURE>
<FEATURE type="das-private:T10" uri="das-private:F239">
<LOC start="22616" segment="http://cgi.biodas.org:8081/seq/fly_43/4"
end="23205
" strand="-1"/>
<PARENT uri="das-private:F232"/>
</FEATURE>
</FEATURES>
Note the change in start position because GFF3 is a "start with 1"
numbering
system while DAS2 is a "start with 0". Note also that I've used the
Name
property from GFF3 to populate the title field in DAS2. While I have
ideas
on what to do with the rest (eg, populate the dbxref DAS2 element), I
don't
know what to do with all of the fields and would like advice.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list