[Bioperl-l] removing duplicate fasta records

Simon K. Chan bioinformatics_rocks@yahoo.com
Thu, 19 Dec 2002 17:14:37 -0800 (PST)


Hiya Amit,

Just to add what Jonathan Epstein wrote, try something
like this:

#!/usr/bin/perl -w

user strict;
use Bio::SeqIO;

my $in = Bio::SeqIO->newfh(-file=>$fileName,
-format=>"FASTA");

my %matching_hash = ();
my %final_hash = ();

while (my $obj = <$in>){

       
       unless($matching_hash{$obj->seq}){
            $final_hash{$obj->display_id} = $obj->seq;
            $matcthing_hash{$obj->seq} = 1;
       }

} 

TMTOWTDI!

:-)

HTH,
Simon



--- Jonathan Epstein <Jonathan_Epstein@nih.gov> wrote:
> To: "Amit Indap <indapa@cs.arizona.edu>"
> <indapa@amadeus.biosci.arizona.edu>,
>        bioperl-l@bioperl.org
> From: Jonathan Epstein <Jonathan_Epstein@nih.gov>
> Subject: Re: [Bioperl-l] removing duplicate fasta
> records
> Date: Tue, 17 Dec 2002 15:53:57 -0500
> 
> Option #1:
>   create a hash with all the sequences as you read
> them, and check for duplicates by seeing whether
> that hash element already exists
> 
> Option #2 (slightly harder):
>   create such a hash and check for duplicates, but
> instead of hashing the sequences, hash a checksum of
> the sequences such as MD5:
>    
>
http://search.cpan.org/author/GAAS/Digest-MD5-2.20/MD5.pm
> or the GCG checksum:
>    
>
http://search.cpan.org/author/BIRNEY/bioperl-1.0.2/Bio/SeqIO/gcg.pm
> 
> This requires more CPU time, but much less memory.
> 
> 
> Option #1 is quick-and-dirty, and is appropriate if
> your input file contains only a few megabytes of
> data (or less).
> 
> Jonathan
> 
> 
> At 12:41 PM 12/17/2002 -0700, Amit Indap
> <indapa@cs.arizona.edu> wrote:
> >I have a file with a list of fasta sequences. Is
> there a way to 
> >remove records with the identical sequence? I am a
> newbie to BioPerl,
> >and my search through the documentation hasn't
> found anything.
> >
> >Thank you.
> >
> >Amit Indap
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com