[Bioperl-l] removing duplicate fasta records
   
    Simon K. Chan
     
    bioinformatics_rocks@yahoo.com
       
    Thu, 19 Dec 2002 17:14:37 -0800 (PST)
    
    
  
Hiya Amit,
Just to add what Jonathan Epstein wrote, try something
like this:
#!/usr/bin/perl -w
user strict;
use Bio::SeqIO;
my $in = Bio::SeqIO->newfh(-file=>$fileName,
-format=>"FASTA");
my %matching_hash = ();
my %final_hash = ();
while (my $obj = <$in>){
       
       unless($matching_hash{$obj->seq}){
            $final_hash{$obj->display_id} = $obj->seq;
            $matcthing_hash{$obj->seq} = 1;
       }
} 
TMTOWTDI!
:-)
HTH,
Simon
--- Jonathan Epstein <Jonathan_Epstein@nih.gov> wrote:
> To: "Amit Indap <indapa@cs.arizona.edu>"
> <indapa@amadeus.biosci.arizona.edu>,
>        bioperl-l@bioperl.org
> From: Jonathan Epstein <Jonathan_Epstein@nih.gov>
> Subject: Re: [Bioperl-l] removing duplicate fasta
> records
> Date: Tue, 17 Dec 2002 15:53:57 -0500
> 
> Option #1:
>   create a hash with all the sequences as you read
> them, and check for duplicates by seeing whether
> that hash element already exists
> 
> Option #2 (slightly harder):
>   create such a hash and check for duplicates, but
> instead of hashing the sequences, hash a checksum of
> the sequences such as MD5:
>    
>
http://search.cpan.org/author/GAAS/Digest-MD5-2.20/MD5.pm
> or the GCG checksum:
>    
>
http://search.cpan.org/author/BIRNEY/bioperl-1.0.2/Bio/SeqIO/gcg.pm
> 
> This requires more CPU time, but much less memory.
> 
> 
> Option #1 is quick-and-dirty, and is appropriate if
> your input file contains only a few megabytes of
> data (or less).
> 
> Jonathan
> 
> 
> At 12:41 PM 12/17/2002 -0700, Amit Indap
> <indapa@cs.arizona.edu> wrote:
> >I have a file with a list of fasta sequences. Is
> there a way to 
> >remove records with the identical sequence? I am a
> newbie to BioPerl,
> >and my search through the documentation hasn't
> found anything.
> >
> >Thank you.
> >
> >Amit Indap
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com