[Biojava-l] XFF: a simple genome annotation

Thomas Down td2@sanger.ac.uk
Mon, 4 Sep 2000 11:53:23 +0100


--BOKacYhQ+x31HxR3
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi...

For the last few days, I've been playing around with a new schema
for representing a hierarchical collection of feature objects
(e.g. a `promoter' feature could contain a set of sub-features
annotating individual transcription factor binding sites).  It
draws heavily on the Feature object model developed for the 
BioJava project, although obviously the XML itself isn't tied
to BioJava.

The idea is to just define a model for the `skeleton' of a
feature tree.  It is then possible to add extra information
by either making derived types of `feature', or by using
`detail' objects.  The aim is to give a format which can
be as simple as GFF, but which can be extended in arbitrary
ways while still allowing the data model to be properly
validated.

I've attached the current version of my schema (in XSD format --
I'm afraid I'm no good at writing DTDs, and in any case the
inheritance model can't be represented in DTD).  There's also
a quick test document.  Let me know what you think.

I'm actually very taken with XSD schema language -- it works
well with all the various ways that namespaces can be used in
XML documents, and the derived types are potentially very powerful.

Note: right now I'm not trying to propose this as a new
specification (there are still issues to sort out).  But
I'd be interested to know what people think of the 
direction I'm taking.

Happy hacking,

   Thomas
-- 
One of the advantages of being disorderly is that one is
constantly making exciting discoveries.
                                       -- A. A. Milne

--BOKacYhQ+x31HxR3
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="xff.xsd"

?xml version="1.0" ?>

<xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema"
            xmlns="http://www.bioxml.org/2000/xff"
            targetNamespace="http://www.bioxml.org/2000/xff"
	    elementFormDefault="qualified"
	    attributeFormDefault="unqualified">

<xsd:annotation>
  <xsd:documentation>
    This is a simple schema for bioinformatics feature data.
    It is based heavily on the object model used by the BioJava
    project, which in turn was inspired partially by the GFF
    file format.

    This is not proposed as a standard format -- it is simply intended
    as a discussion piece.

    Thomas Down (td2@sanger.ac.uk)
  </xsd:documentation>
</xsd:annotation>

<xsd:element name="location">
  <xsd:annotation>
    <xsd:documentation>
      Element which indicates a location within a sequence.
      It consists of one or more spans.
    </xsd:documentation>
  </xsd:annotation>

  <xsd:complexType>
    <xsd:element name="span" minOccurs="1" maxOccurs="unbounded">
      <xsd:complexType>
        <xsd:attribute name="start" type="xsd:int" use="required" />
	<xsd:attribute name="stop" type="xsd:int" use="required" />
      </xsd:complexType>
    </xsd:element>
  </xsd:complexType>
</xsd:element>

<xsd:element name="detail" type="detail" />
<xsd:complexType name="detail" abstract="true">
  <xsd:annotation>
    <xsd:documentation>
      This is an empty type which can be extended to attach
      any arbitrary data model to an XFF feature.
    </xsd:documentation>
  </xsd:annotation>
</xsd:complexType>

<xsd:element name="featureSet">
  <xsd:complexType>
    <xsd:element ref="feature" minOccurs="0" maxOccurs="unbounded" />
  </xsd:complexType>
</xsd:element>

<xsd:element name="feature" type="feature_type" />

<xsd:complexType name="feature">
  <xsd:element name="type" type="xsd:string" minOccurs="1" maxOccurs="1" />
  <xsd:element name="source" type="xsd:string" minOccurs="1" maxOccurs="1" />
  <xsd:element name="location" type="location" minOccurs="1" maxOccurs="1" />
  <xsd:element name="details">
    <xsd:complexType>
      <xsd:element ref="detail" minOccurs="1" maxOccurs="unbounded" />
    </xsd:complexType>
  </xsd:element>
  <xsd:element ref="featureSet" />
</xsd:complexType>

<xsd:complexType name="strandedFeature" base="feature" derivedBy="extension">
  <xsd:attribute name="strand">
    <xsd:simpleType base="xsd:string" derivedBy="restriction">
      <xsd:enumeration value="+" />
      <xsd:enumeration value="-" />
    </xsd:simpleType>
  </xsd:attribute>
</xsd:complexType>

<xsd:element name="strandedFeature type="strandedFeature" equivClass="feature" />

</xsd:schema>



--BOKacYhQ+x31HxR3
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="xfftest.xml"

<?xml version="1.0" ?>

<featureSet xmlns="http://www.bioxml.org/2000/xff"
            xmlns:eponine="http://www.biojava.org/2000/eponine">
  <strandedFeature strand="+">
    <type>promoter</type>
    <source>Eponine 2.0</source>
    <location>
      <span start="230" stop="543" />
    </location>
    <featureSet>
      <feature>
        <type>binding_site</type>
	<source>Eponine 2.0</source>
	<location>
	  <span start="532" stop="540" />
        </location>
	<details>
	  <eponine:detail>
	    <eponine:name>TATA box</eponine:name>
	    <eponine:transfacID>ID12345</eponine:transfacID>
	    <eponine:score>0.0467</eponine:score>
          </eponine:detail>
        </details>
      </feature>

      <feature>
        <type>binding_site</type>
	<source>Eponine 2.0</source>
	<location>
	  <span start="500" stop="509" />
        </location>
	<details>
	  <eponine:details>
	    <eponine:name>fos/jun</eponine:name>
	    <eponine:transfacID>ID23456</eponine:transfacID>
	    <eponine:score>0.175</eponine:score>
          </eponine:details>
        </detail>
      </feature>
    </featureSet>
  </feature>

  <strandedFeature strand="+">
    <type>mRNA</type>
    <source>UltraGenScan</source>
    <location>
      <span start="544" stop="654" />
      <span start="743" stop="865" />
    </location>
  </strandedFeature>
</featureSet>
--BOKacYhQ+x31HxR3--