[DAS2] Sequence retrieval proposal

Andrew Dalke dalke at dalkescientific.com
Wed Dec 7 23:22:56 UTC 2005


Steve:
> This raises another issue we didn't discuss: How about allowing some
> way to verify that the sequence data received from a given reference
> server are in fact faithful copies?

> Use case 1: Validate a given reference server as providing correct
> 1. What do folks think about adding to the DAS/2 retrieval spec
>    facilities supporting sequence data validation? (i.e., Add an
>    optional checksum attribute in the REGION response.)

How many people actually write client code which verifies the checksum
of those formats which have a checksum?  I know I never have.
Bioperl's genbank.pm doesn't check the atcg counts, nor does swiss.pm
check the crc.  (Both generate the checks; they just don't verify them.)

For those who have implemented checksum verification, how many
times has that checksum detected an error in the data transmission?

There are already several layers of checksums in the network
connection.  One in ethernet, another in IP, a third in TCP.
Is another one useful?

As an example, HTTP and (I think) ftp don't use checksums.  I've
transfered many very large files via both and not had a problem.
Rather, the only check I needed was to verify that I got
all of the data, and HTTP provides that information in the header.

Now, I know that there are problems when you scale to large
data transfers.  I even remember talking with Gregg and Lincoln
about this years ago.  A friend of mine went to

   a presentation at Stanford that Bram Cohen gave about bittorrent
   and he was commenting that the four byte check summing in TCP/IP
   isn't enough for his needs as when you're trying to transfer a
   4 gig file to 10,000 users the check summing in TCP/IP isn't enough.

We aren't in the terabyte data transfer range.

   ... doing research ...

But if it does become a concern, one solution is RFC 1864
    http://www.faqs.org/rfcs/rfc1864.html
which adds a "Content-MD5" header to the HTTP response, and
describes how to use it.  Another is RFC 3230
    http://www.scit.wlv.ac.uk/rfc/rfc32xx/RFC3230.html

As far as I can tell, very few people, if any, actually use those
fields for anything.  That serves as a sort of confirmation that
data rarely gets corrupted at the TCP/IP level.

> 2. What do folks think about specifying a DAS2XML format for sequence
>    requests (text/x-das-sequence+xml)? In addition to permitting an
>    optional checksum attribute to address the above use case, it would
>    add some consistency and flexibility to the spec, since at present,
>    the default sequence response format is the only one that is not 
> under
>    our control (currently it's text/x-fasta).

As a consumer of this sort of data, I don't want to write another
parser.  It isn't just the parsing part - it's the effort of mapping
to my program's data model.

There's already a huge number of existing sequence file formats.
What would another provide?  Are some of them already extensible?

Several of those formats are designed and developed by people involved
with DAS.  If it's important, extend GAME or GFF.

As a spec writer, I don't really want to write that part of the spec.

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list