[Biopython-dev] Fwd: SeqXML an alternative for FASTA

Tue Jul 5 16:10:14 UTC 2011

Hi all,

I've been in touch with Thomas Schmitt about merging read/write
support for the SeqXML file format (see below and http://seqxml.org/ )
into Biopython's SeqIO module.

BioPerl already supports this (under format name "seqxml") and a
BioJava v3 implementation is in progress. We're discussing this and
the format itself on the cross project OBF mailing list (see below),

http://lists.open-bio.org/pipermail/open-bio-l/2011-July/000805.html

Please feel free to join that list if you want to discuss anything
general, or comment here on the Biopython implementation.
I've got a branch which seems nearly ready for merging on
github, https://github.com/peterjc/biopython/commits/seqxml2
a rebase of https://github.com/peterjc/biopython/commits/seqxml

Regards,

Peter

---------- Forwarded message ----------
From: Thomas Schmitt <Thomas.Schmitt at sbc.su.se>
Date: Fri, Jul 1, 2011 at 8:57 AM
Subject: [Open-bio-l] SeqXML an alternative for FASTA
To: open-bio-l at lists.open-bio.org

Hello everybody,

We recently published a new XML format called SeqXML to store
biological sequences. Our aim was to create a lightweight alternative
to FASTA that allows to store the metadata that is typical squeezed
into a FASTA header in a standardized way.

It looks something like this:

<seqXML speciesName="Homo sapiens" ncbiTaxID="9606" source="Ensembl">
   <entry id="ENST00000308775" >
       <description>dystroglycan 1</description>
       <RNAseq>AAGGCGAUGUC.....ACAU</RNAseq>
       <DBRef type="DNA" source="RefSeq" id="NM_004393"/>
       <property name="prediction_method" value="manual curation"/>
   </entry>
   <entry id="ENSP00000312435" >
       <AAseq>AAGGCGAAA...CACJOXA</AAseq>
   </entry>
<seqXML/>

Check out the paper at
http://bib.oxfordjournals.org/content/early/2011/06/10/bib.bbr025.full?keytype=ref&ijkey=dWzLPFBuzrdZme8
There is also a website (http://seqxml.org) where you can find the
schema and a detailed documentation. The whole thing emerged from
developing formats for the orthology community so you will also find
information about our orthology format OrthoXML at these resources.

To my knowledge the only format comparable to SepXML is TinySeq which
does have some significant limitation:

- It doesn't support database cross referencing
- The identifiers are more NCBI specific
- It is more verbose
- There is only a very primitive DTD
- It doesn't allow to validate the sequence alphabet
- It isn't possible to define the source of the sequences
- It doesn't support key value pair annotations

We are trying to get IO implementations for SeqXML for all Bio* projects.
There is already an implementation in BioPerl maintained by Dave
Messina. We do have an implementation for the legacy version of
BioJava and Andrew Yates promised to help us migrating it into BioJava
3.
I'm also in contact with Peter Cock about a Biopython integration. He
in fact asked me to move the discussion to this list.

What do you guys thinks about the format?
Is there anybody who wants to contribute with a BioRuby implementation?

Best regards,
Thomas

_______________________________________________
Open-Bio-l mailing list
Open-Bio-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/open-bio-l