[Bioperl-l] removing duplicate fasta records

Paul Gordon gordonp@cbr.nrc.ca
Tue, 17 Dec 2002 18:39:37 -0700


Jonathan Epstein wrote:

>Option #1:
>  create a hash with all the sequences as you read them, and check for duplicates by seeing whether that hash element already exists
>
Which in the spirit of ridiculous one-liners would be:

perl -ne 'BEGIN{$/=">";$"=";"}/(.*?)\n(.+?)\s*>?$/s && push 
@{$h{$2}},$1;END{for(keys%h){print ">@{$h{$_}}\n$_\n"}}'

Which will remove redundant entries AND concatenate their description 
lines :-)