[Open-bio-l] SeqXML an alternative for FASTA

Thomas Schmitt Thomas.Schmitt at sbc.su.se
Fri Jul 1 07:57:05 UTC 2011


Hello everybody,

We recently published a new XML format called SeqXML to store biological sequences. Our aim was to create a lightweight alternative to FASTA that allows to store the metadata that is typical squeezed into a FASTA header in a standardized way. 

It looks something like this:

<seqXML speciesName="Homo sapiens" ncbiTaxID="9606" source="Ensembl">
    <entry id="ENST00000308775" >
        <description>dystroglycan 1</description>
        <RNAseq>AAGGCGAUGUC.....ACAU</RNAseq>
        <DBRef type="DNA" source="RefSeq" id="NM_004393"/>
        <property name="prediction_method" value="manual curation"/>
    </entry>
    <entry id="ENSP00000312435" >
        <AAseq>AAGGCGAAA...CACJOXA</AAseq>
    </entry>
<seqXML/>

Check out the paper at http://bib.oxfordjournals.org/content/early/2011/06/10/bib.bbr025.full?keytype=ref&ijkey=dWzLPFBuzrdZme8
There is also a website (http://seqxml.org) where you can find the schema and a detailed documentation. The whole thing emerged from developing formats for the orthology community so you will also find information about our orthology format OrthoXML at these resources.


To my knowledge the only format comparable to SepXML is TinySeq which does have some significant limitation:

- It doesn't support database cross referencing
- The identifiers are more NCBI specific
- It is more verbose
- There is only a very primitive DTD
- It doesn't allow to validate the sequence alphabet
- It isn't possible to define the source of the sequences
- It doesn't support key value pair annotations 


We are trying to get IO implementations for SeqXML for all Bio* projects.
There is already an implementation in BioPerl maintained by Dave Messina. We do have an implementation for the legacy version of BioJava and Andrew Yates promised to help us migrating it into BioJava 3.
I'm also in contact with Peter Cock about a Biopython integration. He in fact asked me to move the discussion to this list.

What do you guys thinks about the format?
Is there anybody who wants to contribute with a BioRuby implementation?


Best regards,
Thomas




More information about the Open-Bio-l mailing list