[Biojava-l] GCG format...

sanjay kumar@neb.com
Tue, 15 Oct 2002 13:24:00 -0400


Mark wrote :

> supposed to be used. Ie GCG won't tell you how to do it so doing it
> anyway might be breaking some annoying little copyright law.
>
> - Mark


Back in 1999, Lynn Miller, then Bioinformatics Support Coordinator for GCG,
kindly provided information on GCG's formats (so they *will* tell you, or at
least they used to).  No copyright law issues were raised at that point and
my stated goal was to read/write GCG format sequences in software for my
company.  For the record, this is what GCG provided back then regarding
checksums (note that the code is C, not Java):

---- from GCG -----
Checksum calculations

Most sequence files require the use of a checksum to verify that a sequence
has not been modified by hand. The checksum is calculated only for the
sequence it self, not the entire file. This checksum calculation is used by
single sequence files, MSF files, and (optionally) RSF files. The
calculation as stated in C code is:

int chksum(char *sequence)
{
    int len, position, chk = 0;
    len = strlen(sequence);
    for (position = 0; position < len; position ++)
        chk = (chk + (position % 57 + 1) * ( toupper(sequence[position]) ) )
%10000;
    return chk;
}

Note that checksums are case insensitive. Also note that the checksum for an
aligned RSF sequence may differ from the same sequence in an MSF file or
single sequence file. This is due to the fact that leading gaps, if any, are
removed in RSF format.
--------------------

Sanjay