[Bioperl-l] Removing duplicate FASTA record one-liners
Paul Gordon
gordonp@cbr.nrc.ca
Thu, 19 Dec 2002 11:24:58 -0700
>
>>> Option #1:
>>> create a hash with all the sequences as you read them, and check
>>> for duplicates by seeing whether that hash element already exists
>>>
>> Which in the spirit of ridiculous one-liners would be:
>>
>> perl -ne 'BEGIN{$/=">";$"=";"}/(.*?)\n(.+?)\s*>?$/s && push
>> @{$h{$2}},$1;END{for(keys%h){print ">@{$h{$_}}\n$_\n"}}'
>>
>> Which will remove redundant entries AND concatenate their description
>> lines :-)
>
>
> does not work properly, because the parse of the >header\nsequence:
>
> /(.*?)\n(.+?)\s*>?$/s
>
> leaves in the '\n's. Thus, not only would the sequences have to be
> identical,
> their line breaks would have to be identical as well.
>
> Lincoln's solution is not only easier to read, it actually works.
Hee hee. I didn't think this bit of fun (i.e. not intended for
production use, hence the use of the word "ridiculous" in my original
mail) code would spark such a long discussion! For brevity I didn't
include the 'tr' statement required to deal with differently formatted
files. The real point of my mail though is that if you want to make
this complete, both Lincoln and my own code are missing a lc() or uc()
(your choice) in the command to make sure the sequences aren't the same
data in a different case.
Cheers,
Paul