[Biopython] matching sequences from fasta files

Peter biopython at maubp.freeserve.co.uk
Wed Mar 10 13:00:15 UTC 2010


On Wed, Mar 10, 2010 at 11:15 AM, Ivan Rossi <ivan at biodec.com> wrote:
> On Wed, 10 Mar 2010, Peter wrote:
>
>> For the special case of looking for perfect matches, you would be fine
>> with just Python - depending on your data files, you may be able to
>> match on the record identifiers
>
> Don't trust that. We have seen many many times the sequence change
> over time (in different releases of the databases) while keeping the same id.

Yes, be cautious about blindly matching on just the identifier.
That's why I said "may" ;)

> it is much more robust to compare SHA1 (or MD5) hashes of the
> sequence, or do string comparisons.

MD5 is known to have collisions, but Sebastián Bassi added support
in Biopython for the GCG and SEGUID checksums, e.g. see:

from Bio.SeqUtils.CheckSum import seguid
help(seguid)

SHA1 is used by SEGUID internally, taking care of the case.

Peter




More information about the Biopython mailing list