[Bioperl-l] removing duplicate fasta records
Paul Gordon
gordonp@cbr.nrc.ca
Tue, 17 Dec 2002 18:39:37 -0700
Jonathan Epstein wrote:
>Option #1:
> create a hash with all the sequences as you read them, and check for duplicates by seeing whether that hash element already exists
>
Which in the spirit of ridiculous one-liners would be:
perl -ne 'BEGIN{$/=">";$"=";"}/(.*?)\n(.+?)\s*>?$/s && push
@{$h{$2}},$1;END{for(keys%h){print ">@{$h{$_}}\n$_\n"}}'
Which will remove redundant entries AND concatenate their description
lines :-)