[Open-bio-l] SeqXML an alternative for FASTA

Peter Cock p.j.a.cock at googlemail.com
Tue Jul 5 15:23:14 UTC 2011


On Tue, Jul 5, 2011 at 3:57 PM, Thomas Schmitt <Thomas.Schmitt at sbc.su.se> wrote:
> On Jul 4, 2011, at 1:41 PM, Peter Cock wrote:
>>
>> I'm wondering about the predefined allowed character sets for DNA, RNA
>> and Protein, and if they are overly prescriptive for some special use-cases.
>> Extra symbols are sometimes included for things like frame shifts, or to
>> indicate different stop codons.
>>
>> Related to this, what about things like modified RNA (a vast alphabet),
>> or color space (used in the ABI Solid Sequencing platform)?
>> The simple answer is these are out of scope ;)
>
> Right now SeqXML supports 3 different alphabets. These cover the
> basic use-cases and shouldn't be changed. But one can easily add
> more alphabets for special purposes in the form of different sequence
> types.

OK, if there is a common need. For now I'd keep it simple.

> What comes into my mind apart from the above mentioned are quality
> values and RNA secondary structures.

And also protein secondary structures (e.g. alpha/beta/coil). These
fit the per-letter-annotation idea we have in Biopython.

> Because the sequence type is not defined at the entry level adding
> new types is backwards compatible. Having these different sequences
> one might also want to allow more than one sequence per entry.
> I do however think we should be careful with adding new features.
> We don't want to cover every possible use-case and end up with a
> format monster. Our goal was to create a simple format that fulfills the
> typical needs for FASTA. The question that remains to be solved is
> what is typical.

Indeed. As things stand, SeqXML looks capable of covering many
of the uses of FASTA.

>
> Another issue that I see is API support. Do all Bio* API support such
> special alphabets?
>

We have someone looking at adding support for these modified RNA
alphabets to Biopython, but it isn't committed to the trunk yet.

>> However, the main missing feature for me is a feature table as in
>> the GenBank, GenPept, EMBL, SwissProt etc flat files, and also
>> represented in some way in their XML equivalents:
>>
>> http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd
>> http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.mod.dtd
>> (I haven't found the details of the feature tables/sets yet)
>>
>> http://www.uniprot.org/docs/uniprot.xsd
>> http://www.uniprot.org/docs/xml_news.htm
>> (Biopython already has a parser for the UniProt XML format, including
>> the features.)
>>
>> Clearly there is overlap here with GFF3 as well - so this is a potential
>> mine field of compatibility issues. Again, the simple answer is features
>> are out of scope.
>
> SeqXML supports simple features in the form of key-value pairs.

Yes, and that is important.

> Rich position specific feature tables are something for full blown record
> formats like the ones you mentioned, which we are clearly not trying to
> create. So in short I would say out of scope.

Keeping things simple it fine with me, but of course this does
mean limiting the use cases where SeqXML could be suitable.

Regards,

Peter



More information about the Open-Bio-l mailing list