[Bioperl-l] removing duplicate fasta records
Lincoln Stein
lstein@cshl.org
Tue, 17 Dec 2002 15:57:38 -0500
Here's one way. Note that it will remove duplicate records if the sequence is
identical, but doesn't examine the description line. You may want to modify
this if you have a different definition of identical sequences.
Lincoln
#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Digest::MD5 'md5_hex';
my %digests;
my $in = Bio::SeqIO->new(-fh => \*ARGV);
my $out = Bio::SeqIO->new;
while (my $seq = $in->next_seq) {
next if $digests{md5_hex($seq->seq)}++;
$out->write_seq($seq);
}
On Tuesday 17 December 2002 02:41 pm, "Amit Indap wrote:
> I have a file with a list of fasta sequences. Is there a way to
> remove records with the identical sequence? I am a newbie to BioPerl,
> and my search through the documentation hasn't found anything.
>
> Thank you.
>
> Amit Indap
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
========================================================================