[Bioperl-l] Removing duplicate FASTA record one-liners

Thu, 19 Dec 2002 21:36:27 -0500

Yah.  Don't know how I missed that one.

Lincoln

On Thursday 19 December 2002 01:24 pm, Paul Gordon wrote:
> >>> Option #1:
> >>>  create a hash with all the sequences as you read them, and check
> >>> for duplicates by seeing whether that hash element already exists
> >>
> >> Which in the spirit of ridiculous one-liners would be:
> >>
> >> perl -ne 'BEGIN{$/=">";$"=";"}/(.*?)\n(.+?)\s*>?$/s && push
> >> @{$h{$2}},$1;END{for(keys%h){print ">@{$h{$_}}\n$_\n"}}'
> >>
> >> Which will remove redundant entries AND concatenate their description
> >> lines :-)
> >
> > does not work properly, because the parse of the >header\nsequence:
> >
> >     /(.*?)\n(.+?)\s*>?$/s
> >
> > leaves in the '\n's.  Thus, not only would the sequences have to be
> > identical,
> > their line breaks would have to be identical as well.
> >
> > Lincoln's solution is not only easier to read, it actually works.
>
> Hee hee.  I didn't think this bit of fun (i.e. not intended for
> production use, hence the use of the word "ridiculous" in my original
> mail) code would spark such a long discussion!  For brevity I didn't
> include the 'tr' statement required to deal with differently formatted
> files.  The real point of my mail though is that if you want to make
> this complete, both Lincoln and my own code are missing a lc() or uc()
> (your choice) in the command to make sure the sequences aren't the same
> data in a different case.
>
> Cheers,
>     Paul
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================