[Bioperl-l] Fishing redundant sequences in FASTA files [Right formatting]
Chris Fields
cjfields at illinois.edu
Tue Feb 15 21:01:49 UTC 2011
On Feb 15, 2011, at 2:47 PM, Dave Messina wrote:
> SHA should work as well, didn't think of that (though I suppose the encoding
>> step for either would be rate-limiting?).
>>
>
> I haven't tested it, but I suspect that encoding either MD5 or SHA would be
> relatively quick compared to the sequence I/O, no?
Possibly. But one nice thing is clustering allows for partial matches (which I think is the original criterion). I don't believe SHA/MD5 would work for that purpose.
> Will have to keep an eye on UCLUST, didn't know about that one.
>
>
> As it happens, my current pipeline uses MCL but I'm testing UCLUST as a
> replacement since it's waaay faster. I'll let you know how the comparison
> turns out.
>
> And for that matter, if anyone listening has experience with UCLUST or
> CD-HIT or other clustering methods (ideally in the context of metagenomic
> next-gen sequence), please chime in with your thoughts.
As malcolm pointed out, blastclust is also available with legacy BLAST, though I'm not sure it's available with BLAST+ (didn't see anything obvious with BLAST+ for that purpose).
chris
More information about the Bioperl-l
mailing list