[Open-bio-l] SeqXML an alternative for FASTA

Tue Jul 5 15:23:14 UTC 2011

On Tue, Jul 5, 2011 at 3:57 PM, Thomas Schmitt <Thomas.Schmitt at sbc.su.se> wrote:
> On Jul 4, 2011, at 1:41 PM, Peter Cock wrote:
>>
>> I'm wondering about the predefined allowed character sets for DNA, RNA
>> and Protein, and if they are overly prescriptive for some special use-cases.
>> Extra symbols are sometimes included for things like frame shifts, or to
>> indicate different stop codons.
>>
>> Related to this, what about things like modified RNA (a vast alphabet),
>> or color space (used in the ABI Solid Sequencing platform)?
>> The simple answer is these are out of scope ;)
>
> Right now SeqXML supports 3 different alphabets. These cover the
> basic use-cases and shouldn't be changed. But one can easily add
> more alphabets for special purposes in the form of different sequence
> types.

OK, if there is a common need. For now I'd keep it simple.

> What comes into my mind apart from the above mentioned are quality
> values and RNA secondary structures.

And also protein secondary structures (e.g. alpha/beta/coil). These
fit the per-letter-annotation idea we have in Biopython.

> Because the sequence type is not defined at the entry level adding
> new types is backwards compatible. Having these different sequences
> one might also want to allow more than one sequence per entry.
> I do however think we should be careful with adding new features.
> We don't want to cover every possible use-case and end up with a
> format monster. Our goal was to create a simple format that fulfills the
> typical needs for FASTA. The question that remains to be solved is
> what is typical.

Indeed. As things stand, SeqXML looks capable of covering many
of the uses of FASTA.

>
> Another issue that I see is API support. Do all Bio* API support such
> special alphabets?
>

We have someone looking at adding support for these modified RNA
alphabets to Biopython, but it isn't committed to the trunk yet.

>> However, the main missing feature for me is a feature table as in
>> the GenBank, GenPept, EMBL, SwissProt etc flat files, and also
>> represented in some way in their XML equivalents:
>>
>> http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd
>> http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.mod.dtd
>> (I haven't found the details of the feature tables/sets yet)
>>
>> http://www.uniprot.org/docs/uniprot.xsd
>> http://www.uniprot.org/docs/xml_news.htm
>> (Biopython already has a parser for the UniProt XML format, including
>> the features.)
>>
>> Clearly there is overlap here with GFF3 as well - so this is a potential
>> mine field of compatibility issues. Again, the simple answer is features
>> are out of scope.
>
> SeqXML supports simple features in the form of key-value pairs.

Yes, and that is important.

> Rich position specific feature tables are something for full blown record
> formats like the ones you mentioned, which we are clearly not trying to
> create. So in short I would say out of scope.

Keeping things simple it fine with me, but of course this does
mean limiting the use cases where SeqXML could be suitable.

Regards,

Peter