[Bioperl-l] Removing duplicate FASTA record one-liners
William R.Pearson
wrp@alpha0.bioch.virginia.edu
Wed, 18 Dec 2002 15:00:48 -0500
We have had great success with the "hashing" the sequence approach, and
have even used it to find duplicates between nr and trembl, but the
one liner:
>> Option #1:
>> create a hash with all the sequences as you read them, and check for
>> duplicates by seeing whether that hash element already exists
>>
> Which in the spirit of ridiculous one-liners would be:
>
> perl -ne 'BEGIN{$/=">";$"=";"}/(.*?)\n(.+?)\s*>?$/s && push
> @{$h{$2}},$1;END{for(keys%h){print ">@{$h{$_}}\n$_\n"}}'
>
> Which will remove redundant entries AND concatenate their description
> lines :-)
does not work properly, because the parse of the >header\nsequence:
/(.*?)\n(.+?)\s*>?$/s
leaves in the '\n's. Thus, not only would the sequences have to be
identical,
their line breaks would have to be identical as well.
Lincoln's solution is not only easier to read, it actually works.
Bill