[Biojava-l] Biojava XML Binding (BJXB)

Schreiber, Mark mark.schreiber@agresearch.co.nz
Mon, 6 May 2002 16:05:11 +1200


Hi -

I would like to propose/ formalise a schema for binding biojava objects
esp sequence objects to XML. The current binding of Biojava objects to
other formats such as GFF, GenBank, EMBL, Game, Agave is inadequate as
details are lost in the reading and writing of these objects. While it
is useful for biojava to read and write these objects the only way to
currently capture everything about a biojava is to serialize it as a
binary stream. The advantage of serializing to an XML document is that
the XML can be constructed and edited using a text editor or programatic
processes on a machine (possibly a legacy system) with no Biojava
installation and no requirement for a JVM. Also the XML can be ported
via HTTP/ Soap. The DTD could also be used as a base for anyone who
needs a richer schema that maps well to Biojava.

Why not use JAXB? Two reasons, JAXB requires java at both ends of the
serialization / deserialization proceedure. JAXB doesn't play well with
many biojava objetcs due to their use of factory methods, private and
protected constructors and singleton Alphabets. Actually this was all
inspired by my inability to get JAXB to work with biojava.

I have included a demo xml file and a simple dtd. Obviously there is a
lot of room for expansion of the DTD to include more biojava concepts
however I thought I would start with a typical use with a rather nasty
feature structure. Currently there is no read or write ability but StAX
looks like an obvious choice, I suspect there might be a need for a lot
of reflection code in the handlers! I am no StAX expert so if someone
feels particularly inspired in the next 24hours to knock out a quick
handler that would be cool.

Comments and Flames welcome.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE seq_db SYSTEM "bjxb.dtd">

<seq_db class="org.biojava.bio.seq.db.HashSequenceDB">
  <sequence class="org.biojava.bio.seq.impl.SimpleSequence">
    <id name="fooase_est" urn="embl:UA000933"/>
    <symbol_list class="org.biojava.bio.seq.SimpleSymbolList"
alphabet="DNA">
 
accggtatgaccagaggacccatatagggacaaaccaaaaaaaaagcccacagcgcgttgagacagg
      gggacacacccatatttaagaggacaccaaccccccccaaagagagagatnaaaaanaaana
    </symbol_list>
    <annotation class="org.biojava.bio.SimpleAnnotation">
      <entry key="organism" value="Homo Sapiens"/>
      <entry key="seq_type" value="EST"/>
      <entry key="date" value="19/11/2001"/>
    </annotation>
    <feature_holder>
      <feature class="org.biojava.bio.seq.genomic.TranslatedRegion"
               source="auto translation"
               type="predicted peptide">
        <annotation class="org.biojava.bio.Annotation.EmptyAnnotation"/>
        <location value="[7..28]"/>
        <sequence class="org.biojava.bio.seq.impl.SimpleSequence">
          <id name="fooase"/>
            <symbol_list class="org.biojava.bio.seq.SimpleSymbolList"
alphabet="PROTEIN">
              MTRGPI*
            </symbol_list>
            <annotation
class="org.biojava.bio.Annotation.EmptyAnnotation"/>
        </sequence>
        <feature class="org.biojava.bio.seq.impl.SimpleFeature"
                 source="experimental evidence"
                 type="SNP">
          <annotation class="org.biojava.bio.SimpleAnnotation">
            <entry key="SNP_type" value="g:c"/>
          </annotation>
          <location value="14"/>
        </feature>
      </feature>
      <feature class="org.biojava.bio.seq.SimpleFeature"
               source="experimental"
               type="PolyA tail">
         <annotation
class="org.biojava.bio.Annotation.EmptyAnnotation"/>
         <location value="[119..131]"/>
      </feature>
    </feature_holder>
  </sequence>
</seq_db>

<?xml version="1.0" encoding="UTF-8" ?>
<!ELEMENT id EMPTY >
<!ATTLIST id urn NMTOKEN #IMPLIED >
<!ATTLIST id name NMTOKEN #REQUIRED >

<!ELEMENT feature_holder ( feature* ) >

<!ELEMENT annotation ( entry* ) >
<!ATTLIST annotation class NMTOKEN #REQUIRED >

<!ELEMENT sequence ( id, symbol_list, annotation, feature_holder? ) >
<!ATTLIST sequence class NMTOKEN #REQUIRED >

<!ELEMENT seq_db ( sequence+ ) >
<!ATTLIST seq_db class NMTOKEN #REQUIRED >

<!ELEMENT symbol_list ( #PCDATA ) >
<!ATTLIST symbol_list class NMTOKEN #REQUIRED >
<!ATTLIST symbol_list alphabet NMTOKEN #REQUIRED >

<!ELEMENT location EMPTY >
<!ATTLIST location value CDATA #REQUIRED >

<!ELEMENT entry EMPTY >
<!ATTLIST entry key NMTOKEN #REQUIRED >
<!ATTLIST entry value CDATA #REQUIRED >

<!ELEMENT feature ( annotation, location, sequence?, feature? ) >
<!ATTLIST feature type CDATA #REQUIRED >
<!ATTLIST feature source CDATA #REQUIRED >
<!ATTLIST feature class NMTOKEN #REQUIRED >


Mark Schreiber
Bioinformatics
AgResearch Invermay
PO Box 50034
Mosgiel
New Zealand
 
PH:   +64 3 489 9175
FAX:  +64 3 489 3739


=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================