[DAS2] Sequence retrieval proposal

Sun Dec 11 18:40:46 UTC 2005

Thomas:
> Do GAME or GFF have a sequence representation?  I thought they were 
> both primarily feature-table formats (right now I'm having trouble 
> finding the GAME documentation though...).

Others followed up on this.

For me, I was confused.  Even though Steve said "sequence retrieval" --
in the subject even -- I was thinking of feature formats.

I think that came to mind because I expect there to be more feature
data transfered than sequence data, so if data corruption is a concern
then the annotations are more likely to have problems.

Or I may have been thinking about some of the formats (Genbank, 
swissprot)
which combine the two, and have a checksum.

I still don't think checksum-identifiable data corruption is something
we need to worry about.

> The problem I have with Fasta format (other than the tendency of many 
> data-providers to over-load the header line) is that there's no 
> explicit marker for the alphabet and encoding of sequence data.

*sigh*  It seems like this never goes away.  Biopython also has a "rich"
alphabet property, designed to handle alternate alphabets, like 
3-letter codes
and secondary structure alphabets.  Bioperl's seems more appropriate in
practice - dna, protein, rna, and perhaps 'unknown'.

In the context of DAS, this is not a problem.  DAS 2.0 uses only genomic
data, so all FASTA records will be of type 'dna'.

It might be different with structure data where a single record may
have all three alphabet types.  (Though I only know of structures with
2 of the 3.)

> I guess an alternative, more classically RESTful, way of doing things
> might be with MIME types:
>
>        Content-Type: application/fasta; sequence-alphabet=DNA; 
> sequence-encoding=IUPAC
>
> I admit I'd prefer the XML though...

As I mentioned, for purposes of DAS 2.0 this isn't needed so I
don't think we need to solve this problem.

If we do, I think it's a nearly intractable problem.  How does one
register all the different possible alphabets?  IUPAC dna/rna/protein
covers most of it.  Getting the other few percent is hard.  Then
making all the software to preserve or interconvert the different
formats adds another layer of hard.  There's a lot of social issues
as well.

					Andrew
					dalke at dalkescientific.com