[Open-bio-l] SeqXML an alternative for FASTA

Tue Jul 5 15:57:51 UTC 2011

H,

>From a BJ3 perspective we should be able to push this into the API with minimum fuss. We already support the different alphabets including the redundancy codes (not to sure about gaps off the top of my head but something could be done). I can't see this taking too long :)

Andy

On 5 Jul 2011, at 16:35, Chris Fields wrote:

> On Jul 5, 2011, at 9:57 AM, Thomas Schmitt wrote:
> 
>> Hi,
>> 
>> Thanks for the feedback!
>> 
>> On Jul 4, 2011, at 1:41 PM, Peter Cock wrote:
>> 
>>> On Fri, Jul 1, 2011 at 8:57 AM, Thomas Schmitt <Thomas.Schmitt at sbc.su.se> wrote:
>>>> Hello everybody,
>>>> 
>>>> We recently published a new XML format called SeqXML to store biological
>>>> sequences. Our aim was to create a lightweight alternative to FASTA that
>>>> allows to store the metadata that is typical squeezed into a FASTA header
>>>> in a standardized way.
>>>> 
>>>> It looks something like this:
>>>> 
>>>> <seqXML speciesName="Homo sapiens" ncbiTaxID="9606" source="Ensembl">
>>>>  <entry id="ENST00000308775" >
>>>>      <description>dystroglycan 1</description>
>>>>      <RNAseq>AAGGCGAUGUC.....ACAU</RNAseq>
>>>>      <DBRef type="DNA" source="RefSeq" id="NM_004393"/>
>>>>      <property name="prediction_method" value="manual curation"/>
>>>>  </entry>
>>>>  <entry id="ENSP00000312435" >
>>>>      <AAseq>AAGGCGAAA...CACJOXA</AAseq>
>>>>  </entry>
>>>> <seqXML/>
>>>> 
>>>> Check out the paper at http://bib.oxfordjournals.org/content/early/2011/06/10/bib.bbr025.full?keytype=ref&ijkey=dWzLPFBuzrdZme8
>>>> 
>>>> There is also a website (http://seqxml.org) where you can find the schema and a
>>>> detailed documentation. The whole thing emerged from developing formats for the
>>>> orthology community so you will also find information about our orthology format
>>>> OrthoXML at these resources.
>>>> 
>>>> 
>>>> To my knowledge the only format comparable to SepXML is TinySeq which does
>>>> have some significant limitation:
>>>> 
>>>> - It doesn't support database cross referencing
>>>> - The identifiers are more NCBI specific
>>>> - It is more verbose
>>>> - There is only a very primitive DTD
>>>> - It doesn't allow to validate the sequence alphabet
>>>> - It isn't possible to define the source of the sequences
>>>> - It doesn't support key value pair annotations
>>>> 
>>> 
>>> Thanks for the comparison to TinySeq. Did you find a good introductory
>>> document for this file format?
>> 
>> Not really, the only thing I found was the DTD, a very general document, and some examples.
>> 
>> http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.mod.dtd
>> http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt
>> http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&qty=3&c_start=1&list_uids=D12625,D42072,M82814&uids=&dopt=tinyseq&dispmax=5&sendto=
> 
> TinySeq has been around for a while, but it is mainly a very small subset of the ASN.1 output converted to XML via asn2xml.  I have noticed that XML output appears to be limited to eutils now; requests for the data in XML via Entrez is no longer available.  I can kind of understand that, but then they still offer ASN.1.
> 
>>>> We are trying to get IO implementations for SeqXML for all Bio* projects.
>>>> 
>>> 
>>> That would definitely help with getting people using the format.
>>> 
>>>> 
>>>> There is already an implementation in BioPerl maintained by Dave Messina.
>>>> We do have an implementation for the legacy version of BioJava and Andrew
>>>> Yates promised to help us migrating it into BioJava 3.
>>> 
>>> That sounds promising.
> 
> It's a fairly simple format to support.
> 
>>>> I'm also in contact with Peter Cock about a Biopython integration. He in
>>>> fact asked me to move the discussion to this list.
>>> 
>>> :)
>>> 
>>> Note we're using the format name "seqxml" in Biopython's SeqIO to match
>>> what was used in BioPerl's SeqIO.
>>> 
>>>> 
>>>> What do you guys thinks about the format?
>>>> 
>>> 
>>> I'm wondering about the predefined allowed character sets for DNA, RNA
>>> and Protein, and if they are overly prescriptive for some special use-cases.
>>> Extra symbols are sometimes included for things like frame shifts, or to
>>> indicate different stop codons.
>>> 
>>> Related to this, what about things like modified RNA (a vast alphabet),
>>> or color space (used in the ABI Solid Sequencing platform)?
>>> The simple answer is these are out of scope ;)
>> 
>> Right now SeqXML supports 3 different alphabets. These cover the basic use-cases and shouldn't be changed.
>> But one can easily add more alphabets for special purposes in the form of different sequence types. 
>> What comes into my mind apart from the above mentioned are quality values and RNA secondary structures. 
>> Because the sequence type is not defined at the entry level adding new types is backwards compatible. 
>> Having these different sequences one might also want to allow more than one sequence per entry.
>> I do however think we should be careful with adding new features. We don't want to cover every possible use-case 
>> and end up with a format monster. Our goal was to create a simple format that fulfills the typical needs for FASTA.
>> The question that remains to be solved is what is typical.
>> Another issue that I see is API support. Do all Bio* API support such special alphabets?
> 
> I think in general the three main ones are supported in all the Bio* (DNA, RNA, Protein) but as Peter indicates the 'alphabet' could probably be genericized to allow other alphabets for validation, or the scope of the format has to be limited to specific alphabets.
> 
> A bit of history: IIRC on the BioPerl end there was some movement in this direction quite a while ago (Bio::Symbol I believe), but it was never followed through and the code is deprecated.  I think this could be feasibly done but isn't really a high priority.
> 
>>> However, the main missing feature for me is a feature table as in the
>>> GenBank, GenPept, EMBL, SwissProt etc flat files, and also represented
>>> in some way in their XML equivalents:
>>> 
>>> http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd
>>> http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.mod.dtd
>>> (I haven't found the details of the feature tables/sets yet)
>>> 
>>> http://www.uniprot.org/docs/uniprot.xsd
>>> http://www.uniprot.org/docs/xml_news.htm
>>> (Biopython already has a parser for the UniProt XML format, including
>>> the features.)
>>> 
>>> Clearly there is overlap here with GFF3 as well - so this is a potential
>>> mine field of compatibility issues. Again, the simple answer is features
>>> are out of scope.
>> 
>> SeqXML supports simple features in the form of key-value pairs. Rich position specific feature tables
>> are something for full blown record formats like the ones you mentioned, which we are clearly not trying to create.
>> So in short I would say out of scope.
> 
> Makes sense for a simple format.
> 
>>>> Is there anybody who wants to contribute with a BioRuby implementation?
>>>> 
>>>> Best regards,
>>>> Thomas
>>> 
>>> I've also CC'd Peter Rice to ask if SeqXML is something EMBOSS would
>>> consider supporting?
>>> 
>>> Regards,
>>> 
>>> Peter
>> 
>> Cheers,
>> Thomas
> 
> 
> chris

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/