[Bioperl-l] removing duplicate fasta records

Lincoln Stein lstein@cshl.org
Tue, 17 Dec 2002 15:57:38 -0500


Here's one way.  Note that it will remove duplicate records if the sequence is 
identical, but doesn't examine the description line.  You may want to modify 
this if you have a different definition of identical sequences.

Lincoln

#!/usr/bin/perl

use strict;

use Bio::SeqIO;
use Digest::MD5 'md5_hex';

my %digests;
my $in = Bio::SeqIO->new(-fh => \*ARGV);
my $out = Bio::SeqIO->new;

while (my $seq = $in->next_seq) {
	next if $digests{md5_hex($seq->seq)}++;
	$out->write_seq($seq);
}



On Tuesday 17 December 2002 02:41 pm, "Amit Indap wrote:
> I have a file with a list of fasta sequences. Is there a way to
> remove records with the identical sequence? I am a newbie to BioPerl,
> and my search through the documentation hasn't found anything.
>
> Thank you.
>
> Amit Indap
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================