[Open-bio-l] SeqXML an alternative for FASTA

Tue Jul 5 15:35:45 UTC 2011

On Jul 5, 2011, at 9:57 AM, Thomas Schmitt wrote:

> Hi,
> 
> Thanks for the feedback!
> 
> On Jul 4, 2011, at 1:41 PM, Peter Cock wrote:
> 
>> On Fri, Jul 1, 2011 at 8:57 AM, Thomas Schmitt <Thomas.Schmitt at sbc.su.se> wrote:
>>> Hello everybody,
>>> 
>>> We recently published a new XML format called SeqXML to store biological
>>> sequences. Our aim was to create a lightweight alternative to FASTA that
>>> allows to store the metadata that is typical squeezed into a FASTA header
>>> in a standardized way.
>>> 
>>> It looks something like this:
>>> 
>>> <seqXML speciesName="Homo sapiens" ncbiTaxID="9606" source="Ensembl">
>>>   <entry id="ENST00000308775" >
>>>       <description>dystroglycan 1</description>
>>>       <RNAseq>AAGGCGAUGUC.....ACAU</RNAseq>
>>>       <DBRef type="DNA" source="RefSeq" id="NM_004393"/>
>>>       <property name="prediction_method" value="manual curation"/>
>>>   </entry>
>>>   <entry id="ENSP00000312435" >
>>>       <AAseq>AAGGCGAAA...CACJOXA</AAseq>
>>>   </entry>
>>> <seqXML/>
>>> 
>>> Check out the paper at http://bib.oxfordjournals.org/content/early/2011/06/10/bib.bbr025.full?keytype=ref&ijkey=dWzLPFBuzrdZme8
>>> 
>>> There is also a website (http://seqxml.org) where you can find the schema and a
>>> detailed documentation. The whole thing emerged from developing formats for the
>>> orthology community so you will also find information about our orthology format
>>> OrthoXML at these resources.
>>> 
>>> 
>>> To my knowledge the only format comparable to SepXML is TinySeq which does
>>> have some significant limitation:
>>> 
>>> - It doesn't support database cross referencing
>>> - The identifiers are more NCBI specific
>>> - It is more verbose
>>> - There is only a very primitive DTD
>>> - It doesn't allow to validate the sequence alphabet
>>> - It isn't possible to define the source of the sequences
>>> - It doesn't support key value pair annotations
>>> 
>> 
>> Thanks for the comparison to TinySeq. Did you find a good introductory
>> document for this file format?
> 
> Not really, the only thing I found was the DTD, a very general document, and some examples.
> 
> http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.mod.dtd
> http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt
> http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&qty=3&c_start=1&list_uids=D12625,D42072,M82814&uids=&dopt=tinyseq&dispmax=5&sendto=

TinySeq has been around for a while, but it is mainly a very small subset of the ASN.1 output converted to XML via asn2xml.  I have noticed that XML output appears to be limited to eutils now; requests for the data in XML via Entrez is no longer available.  I can kind of understand that, but then they still offer ASN.1.

>>> We are trying to get IO implementations for SeqXML for all Bio* projects.
>>> 
>> 
>> That would definitely help with getting people using the format.
>> 
>>> 
>>> There is already an implementation in BioPerl maintained by Dave Messina.
>>> We do have an implementation for the legacy version of BioJava and Andrew
>>> Yates promised to help us migrating it into BioJava 3.
>> 
>> That sounds promising.

It's a fairly simple format to support.

>>> I'm also in contact with Peter Cock about a Biopython integration. He in
>>> fact asked me to move the discussion to this list.
>> 
>> :)
>> 
>> Note we're using the format name "seqxml" in Biopython's SeqIO to match
>> what was used in BioPerl's SeqIO.
>> 
>>> 
>>> What do you guys thinks about the format?
>>> 
>> 
>> I'm wondering about the predefined allowed character sets for DNA, RNA
>> and Protein, and if they are overly prescriptive for some special use-cases.
>> Extra symbols are sometimes included for things like frame shifts, or to
>> indicate different stop codons.
>> 
>> Related to this, what about things like modified RNA (a vast alphabet),
>> or color space (used in the ABI Solid Sequencing platform)?
>> The simple answer is these are out of scope ;)
> 
> Right now SeqXML supports 3 different alphabets. These cover the basic use-cases and shouldn't be changed.
> But one can easily add more alphabets for special purposes in the form of different sequence types. 
> What comes into my mind apart from the above mentioned are quality values and RNA secondary structures. 
> Because the sequence type is not defined at the entry level adding new types is backwards compatible. 
> Having these different sequences one might also want to allow more than one sequence per entry.
> I do however think we should be careful with adding new features. We don't want to cover every possible use-case 
> and end up with a format monster. Our goal was to create a simple format that fulfills the typical needs for FASTA.
> The question that remains to be solved is what is typical.
> Another issue that I see is API support. Do all Bio* API support such special alphabets?

I think in general the three main ones are supported in all the Bio* (DNA, RNA, Protein) but as Peter indicates the 'alphabet' could probably be genericized to allow other alphabets for validation, or the scope of the format has to be limited to specific alphabets.

A bit of history: IIRC on the BioPerl end there was some movement in this direction quite a while ago (Bio::Symbol I believe), but it was never followed through and the code is deprecated.  I think this could be feasibly done but isn't really a high priority.

>> However, the main missing feature for me is a feature table as in the
>> GenBank, GenPept, EMBL, SwissProt etc flat files, and also represented
>> in some way in their XML equivalents:
>> 
>> http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd
>> http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.mod.dtd
>> (I haven't found the details of the feature tables/sets yet)
>> 
>> http://www.uniprot.org/docs/uniprot.xsd
>> http://www.uniprot.org/docs/xml_news.htm
>> (Biopython already has a parser for the UniProt XML format, including
>> the features.)
>> 
>> Clearly there is overlap here with GFF3 as well - so this is a potential
>> mine field of compatibility issues. Again, the simple answer is features
>> are out of scope.
> 
> SeqXML supports simple features in the form of key-value pairs. Rich position specific feature tables
> are something for full blown record formats like the ones you mentioned, which we are clearly not trying to create.
> So in short I would say out of scope.

Makes sense for a simple format.

>>> Is there anybody who wants to contribute with a BioRuby implementation?
>>> 
>>> Best regards,
>>> Thomas
>> 
>> I've also CC'd Peter Rice to ask if SeqXML is something EMBOSS would
>> consider supporting?
>> 
>> Regards,
>> 
>> Peter
> 
> Cheers,
> Thomas

chris