[Bioperl-l] Removing duplicate FASTA record one-liners

Thu, 19 Dec 2002 11:24:58 -0700

>
>>> Option #1:
>>>  create a hash with all the sequences as you read them, and check 
>>> for duplicates by seeing whether that hash element already exists
>>>
>> Which in the spirit of ridiculous one-liners would be:
>>
>> perl -ne 'BEGIN{$/=">";$"=";"}/(.*?)\n(.+?)\s*>?$/s && push
>> @{$h{$2}},$1;END{for(keys%h){print ">@{$h{$_}}\n$_\n"}}'
>>
>> Which will remove redundant entries AND concatenate their description
>> lines :-)
>
>
> does not work properly, because the parse of the >header\nsequence:
>
>     /(.*?)\n(.+?)\s*>?$/s
>
> leaves in the '\n's.  Thus, not only would the sequences have to be 
> identical,
> their line breaks would have to be identical as well.
>
> Lincoln's solution is not only easier to read, it actually works.

Hee hee.  I didn't think this bit of fun (i.e. not intended for 
production use, hence the use of the word "ridiculous" in my original 
mail) code would spark such a long discussion!  For brevity I didn't 
include the 'tr' statement required to deal with differently formatted 
files.  The real point of my mail though is that if you want to make 
this complete, both Lincoln and my own code are missing a lc() or uc() 
(your choice) in the command to make sure the sequences aren't the same 
data in a different case.

Cheers,
    Paul