[DAS2] best practices / DAS2 format examples

Andrew Dalke dalke at dalkescientific.com
Mon Sep 11 17:52:35 UTC 2006


das2-teleconf-2006-03-16.txt
> [A] Lincoln will provide use cases/examples of these features  
> scenarios:
> - three or greater hierarchy features
> - multiple parents
> - alignments

I really would like some real-world examples of these.  I don't know
enough to make decent examples for the documentation and I think it
would be very useful so others can see how to model existing data
in DAS2 XML.

I looked at GFF3 examples to find existing properties which must be
storable in a DAS2 feature document.  Here are two example lines

ID=FBti0020396;Name=Rt1c{}1472;Dbxref=FlyBase+Annotation+IDs: 
TE20396,FlyBase:FBt
i0020396;cyto_range=102A1-102A1;gbunit=AE003845;synonym=TE20396; 
synonym_2nd=Rt1c
{}1472

ID=FBgn0004859;Name=ci;Dbxref=FlyBase+Annotation+IDs:CG2125,FlyBase: 
FBan0002125,FlyBase:FBgn0004859;cyto_range=102A1-102A3; 
dbxref_2nd=FlyBase:FBgn0000314,FlyBase:FBgn0000315,FlyBase: 
FBgn0010154,FlyBase:FBgn0010155,FlyBase:FBgn0017411,FlyBase: 
FBgn0019831;gbunit=AE003845;synonym_2nd=Ce,Ci,CI,ci155,ciD,ci- 
D,CiD,CID,ci<up>D</up>,Ci<up>D</up>,Cubitus+interruptus,cubitus- 
interruptus-Dominant,l(4)102ABc,l(4)13,l(4)17

I do not know this domain well enough.  I do not how "cyto_range" should
be stored in DAS2 XML nor gbunit.  I don't know the difference between
dbxref and dbxref_2nd.  Nor can I find documentation on these  
properties.
Looking around I came across names

   cyto_range Dbxref dbxref_2nd Name Parent species gbunit Alias

but I don't know how those are best modeled in GFF3.  For example, is
species redundant given that we know that from the reference sequence?

I want someone to be able to go to DAS and easily figure out how to
convert existing data models into DAS's model.


Here is an example of a real-world GFF3 complex annotation, which we're
calling a "feature group" in DAS2.  The top-level is a gene.  It has one
child which is an mRNA.  The mRNA has children of CDS, exon, protein,  
and
intron.  I've added newlines for readability.

4       .       gene    22335   23205   .       -       .        
ID=FBgn0052013;
Name=CG32013;Dbxref=FlyBase+Annotation+IDs:CG32013,FlyBase: 
FBan0032013,FlyBase:
FBgn0052013;cyto_range=101F1-101F1;gbunit=AE003845


4       .       mRNA    22335   23205   .       -       .        
ID=FBtr0089183;
Name=CG32013-RA;Parent=FBgn0052013;Dbxref=FlyBase+Annotation+IDs: 
CG32013-RA,
FlyBase:FBtr0089183;cyto_range=101F1-101F1

4       .       CDS     22335   22528   .       -       .        
Parent=FBtr0089183;
Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA

4       .       exon    22335   22528   .       -       .        
Parent=FBtr0089183

4       .       protein 22338   23205   .       -       .        
ID=FBpp0088247;
Name=CG32013-PA;Parent=FBtr0089183;Dbxref=FlyBase+Annotation+IDs: 
CG32013-PA,
FlyBase:FBpp0088247,GB_protein:AAN06536.1,FlyBase+Annotation+IDs: 
CG32013-RA

4       .       intron  22529   22616   .       -       .        
Parent=FBtr0089183;
Name=CG32013-in

4       .       CDS     22617   23205   .       -       .        
Parent=FBtr0089183;
Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA

4       .       exon    22617   23205   .       -       .        
Parent=FBtr0089183

The direct conversion to DAS2 xml the way I've been doing it is first
defining a TYPES document like this (the das-private: identifiers are
created upon server upload).  Note that I'm storing the GFF3 fields in
a PROP element so I can easily figure out which DAS2 types correspond
to the GFF3 types (unique gff3 types is the pair (type, source) )


<TYPES>
   <TYPE uri="das-private:T8">
     <PROP key="gff3-type" value="gene" />
     <PROP key="gff3-source" value="" />
   </TYPE>
   <TYPE uri="das-private:T9">
     <PROP key="gff3-type" value="mRNA" />
     <PROP key="gff3-source" value="" />
   </TYPE>
   <TYPE uri="das-private:T10">
     <PROP key="gff3-type" value="exon" />
     <PROP key="gff3-source" value="" />
   </TYPE>
</TYPES>

Given the types, the features document looks like.


<FEATURE type="das-private:T8" uri="das-private:F233" title="CG32013">
  <LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="23205
" strand="-1"/>
  <PART uri="das-private:F232"/>
</FEATURE>

<FEATURE type="das-private:T9" uri="das-private:F232"  
title="CG32013-RA">
  <LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="23205
" strand="-1"/>
  <PARENT uri="das-private:F233"/>
  <PART uri="das-private:F234"/>
  <PART uri="das-private:F235"/>
  <PART uri="das-private:F236"/>
  <PART uri="das-private:F237"/>
  <PART uri="das-private:F238"/>
  <PART uri="das-private:F239"/>
</FEATURE>

<FEATURE type="das-private:T8" uri="das-private:F233" title="CG32013">
  <LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="23205
" strand="-1"/>
  <PART uri="das-private:F232"/>
</FEATURE>

<FEATURE type="das-private:T25" uri="das-private:F234"  
title="CG32013-cds">
  <LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="22528
" strand="-1"/>
  <PARENT uri="das-private:F232"/>
</FEATURE>

<FEATURE type="das-private:T10" uri="das-private:F235">
  <LOC start="22334" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="22528
" strand="-1"/>
  <PARENT uri="das-private:F232"/>
</FEATURE>

<FEATURE type="das-private:T24" uri="das-private:F236"  
title="CG32013-PA">
  <LOC start="22337" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="23205
" strand="-1"/>
  <PARENT uri="das-private:F232"/>
</FEATURE>

<FEATURE type="das-private:T27" uri="das-private:F237"  
title="CG32013-in">
  <LOC start="22528" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="22616
" strand="-1"/>
  <PARENT uri="das-private:F232"/>
</FEATURE>

<FEATURE type="das-private:T25" uri="das-private:F238"  
title="CG32013-cds">
  <LOC start="22616" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="23205
" strand="-1"/>
  <PARENT uri="das-private:F232"/>
</FEATURE>

<FEATURE type="das-private:T10" uri="das-private:F239">
  <LOC start="22616" segment="http://cgi.biodas.org:8081/seq/fly_43/4"  
end="23205
" strand="-1"/>
  <PARENT uri="das-private:F232"/>
</FEATURE>

</FEATURES>


Note the change in start position because GFF3 is a "start with 1"  
numbering
system while DAS2 is a "start with 0".  Note also that I've used the  
Name
property from GFF3 to populate the title field in DAS2.  While I have  
ideas
on what to do with the rest (eg, populate the dbxref DAS2 element), I  
don't
know what to do with all of the fields and would like advice.


					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list