[Bioperl-l] removing duplicate fasta records
Jonathan Epstein
Jonathan_Epstein@nih.gov
Tue, 17 Dec 2002 15:53:57 -0500
Option #1:
create a hash with all the sequences as you read them, and check for duplicates by seeing whether that hash element already exists
Option #2 (slightly harder):
create such a hash and check for duplicates, but instead of hashing the sequences, hash a checksum of the sequences such as MD5:
http://search.cpan.org/author/GAAS/Digest-MD5-2.20/MD5.pm
or the GCG checksum:
http://search.cpan.org/author/BIRNEY/bioperl-1.0.2/Bio/SeqIO/gcg.pm
This requires more CPU time, but much less memory.
Option #1 is quick-and-dirty, and is appropriate if your input file contains only a few megabytes of data (or less).
Jonathan
At 12:41 PM 12/17/2002 -0700, Amit Indap <indapa@cs.arizona.edu> wrote:
>I have a file with a list of fasta sequences. Is there a way to
>remove records with the identical sequence? I am a newbie to BioPerl,
>and my search through the documentation hasn't found anything.
>
>Thank you.
>
>Amit Indap