[BioSQL-l] CRC calculation
Hilmar Lapp
hlapp at gnf.org
Tue Aug 23 15:41:33 EDT 2005
On Aug 23, 2005, at 2:24 AM, mark.schreiber at novartis.com wrote:
> Hello -
>
> According the the BioSQL docs the CRC of the Reference table is a
> calculated checksum over the author, location and title fields. It also
> states that it helps ensure uniqueness. I have some questions:
>
> 1) Is the CRC constrained to be unique?
Yes. It is a substitute for a unique constraint over
(authors,title,location) because their combined field length is too
long to create an index on (at least in Oracle), and even if it were
possible it would probably not be a very efficient index.
> If it is that could be a problem as CRC is not guarenteed to be
> unique for any String.
Correct, yes. However, I haven't had a single incident where this was a
problem. Also, I'm using a CRC64, so there's several orders of
magnitudes more possible values than journal articles accumulated over
the past centuries :-)
> 2) Is everyone supposed to use the same algorithm or is it dependent on
> the user?
It is somewhat dependent on the user, in the sense that the Biosql
schema does prescribe which algorithm to use. However, if you use
bioperl/bioperl-db for loading then the CRCs will be calculated for you
(and in fact also for sequences if you use the Oracle schema), so if
you make updates or inserts outside of bioperl as well then you will
probably want to be sure that you use consistent algorithms.
> 3) If 2 then are you using CRC16 or CRC32
Neither. It's a CRC64.
> 4) If 2 are you adding the independant CRC's of the author, location,
> and
> title or are you combining them in a String and calculating the CRC for
> the string. If that is the case then what is the form of the string?
I combine them in a string first, substituting a default value for
undefined properties. The CRC algorithm is the same as the one used for
computing the SwissProt CRC64 (so if Biojava writes Swissprot then you
probably have this algorithm in the library already). Here's the
complete algorithm in perl with preceding documentation. I hope you can
read perl; if not let me know ;-)
=head2 _crc64
Title : _crc64
Usage :
Function: Computes and returns the CRC64 checksum for a given
reference object.
The method uses the reference's authors, title, and
location properties.
Example :
Returns : the CRC64 as a string
Args : the Bio::Annotation::Reference object for which to compute
the CRC
=cut
sub _crc64{
my $self = shift;
my $obj = shift;
my $str =
(defined($obj->authors) ? $obj->authors : "<undef>") .
(defined($obj->title) ? $obj->title : "<undef>") .
(defined($obj->location) ? $obj->location : "<undef>");
return 'CRC-'.$self->crc64($str);
}
=head2 crc64
Title : crc64
Usage :
Function: Computes and returns the CRC64 checksum for a given string.
This is basically ripped out of the bioperl swissprot
parser. Credits go to whoever contributed it there.
Example :
Returns : the CRC64 checksum as a string
Args : the string as a scalar for which to obtain the CRC64
=cut
sub crc64{
my ($self, $str) = @_;
my $POLY64REVh = 0xd8000000;
my @CRCTableh;
my @CRCTablel;
if (exists($self->{'_CRCtableh'})) {
@CRCTableh = @{$self->{'_CRCtableh'}};
@CRCTablel = @{$self->{'_CRCtablel'}};
} else {
@CRCTableh = 256;
@CRCTablel = 256;
for (my $i=0; $i<256; $i++) {
my $partl = $i;
my $parth = 0;
for (my $j=0; $j<8; $j++) {
my $rflag = $partl & 1;
$partl >>= 1;
$partl |= (1 << 31) if $parth & 1;
$parth >>= 1;
$parth ^= $POLY64REVh if $rflag;
}
$CRCTableh[$i] = $parth;
$CRCTablel[$i] = $partl;
}
$self->{'_CRCtableh'} = \@CRCTableh;
$self->{'_CRCtablel'} = \@CRCTablel;
}
my $crcl = 0;
my $crch = 0;
foreach (split '', $str) {
my $shr = ($crch & 0xFF) << 24;
my $temp1h = $crch >> 8;
my $temp1l = ($crcl >> 8) | $shr;
my $tableindex = ($crcl ^ (unpack "C", $_)) & 0xFF;
$crch = $temp1h ^ $CRCTableh[$tableindex];
$crcl = $temp1l ^ $CRCTablel[$tableindex];
}
my $crc64 = sprintf("%08X%08X", $crch, $crcl);
return $crc64;
}
>
> - Mark
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
>
> phone +65 6722 2973
> fax +65 6722 2910
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list