[Bioperl-l] Removing duplicate FASTA record one-liners

William R.Pearson wrp@alpha0.bioch.virginia.edu
Wed, 18 Dec 2002 15:00:48 -0500


We have had great success with the "hashing" the sequence approach, and
have even used it to find duplicates  between nr and trembl, but the 
one liner:

>> Option #1:
>>  create a hash with all the sequences as you read them, and check for 
>> duplicates by seeing whether that hash element already exists
>>
> Which in the spirit of ridiculous one-liners would be:
>
> perl -ne 'BEGIN{$/=">";$"=";"}/(.*?)\n(.+?)\s*>?$/s && push
> @{$h{$2}},$1;END{for(keys%h){print ">@{$h{$_}}\n$_\n"}}'
>
> Which will remove redundant entries AND concatenate their description
> lines :-)

does not work properly, because the parse of the >header\nsequence:

	/(.*?)\n(.+?)\s*>?$/s

leaves in the '\n's.  Thus, not only would the sequences have to be 
identical,
their line breaks would have to be identical as well.

Lincoln's solution is not only easier to read, it actually works.

Bill