[Biopython-dev] interest module for sequence based clustering tools?

Mon Apr 14 17:00:32 UTC 2014

Peter,

It should be general enough. UCLUST was on my list of programs to support
but I back-burnered it because it's not free software/open source. But I'll
take a look and checkout OrthoMCL as well.

I've since added tools that take an iterable of SeqRecords and returns a
clustering. Writing to a temporary FASTA file, running the program, and
parsing the results are all taken care of.

I'm also working on some documentation. I'm planning on having examples of
doing sequence deduplication, building motifs for each cluster, and
recalculating cluster medoids using multiple sequence alignments. Any other
suggestions are more than welcome.

-krish

On Mon, Apr 14, 2014 at 2:20 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Hi Krishna,
>
> This looks interesting, although it is not something I can remember
> doing myself with Python before. Do you think the framework would
> be general enough to cover other sequence clustering tools like
> UCLUST http://drive5.com/usearch/manual/uclust_algo.html
> or OrthoMCL?
>
> Peter
>
> On Tue, Apr 1, 2014 at 7:41 AM, Krishna Roskin <krishnaroskin at gmail.com>
> wrote:
> > Hey all,
> >
> > Long time fan, taking my first crack at contributing.
> >
> > I've built a basic module to run and parse the result of sequence based
> > clustering tools such as DNACLUST and CD-HIT. I've written subclasses of
> > AbstractCommandline to run dnaclust and cd-hit. I've also written classes
> > to store the clusters and their members and loaders for the output
> formats
> > used by those programs.
> >
> > I posting here to gauge interest and get some feedback and maybe some
> beta
> > testers.
> >
> > My code is available at:
> >
> > https://github.com/krishnaroskin/biopython.git
> >
> > under the seqcluster branch. I've started writing some test code at:
> >
> > Tests/seqcluster/test_seqcluster.py
> >
> > that also severs as example code. I've pasted that at the end of this
> > message so people can get an idea of how it works without having to
> > checkout code.
> >
> > If there is interest, I'm planning on adding a seqclust.cluster function
> > that takes a list of SeqRecords and returns a clustering using one of the
> > supported tools. I envision that function being the main interface to
> this
> > module. I also want to write a something that will map the cluster
> > membership (given by ids) back to collections of SeqRecords.
> >
> > Other to-dos:
> >
> > Test that all the flavors of CD-HIT work (there are many)
> > Add support for other sequence based clustering tools (suggestions?)
> > Documentation
> > Tutorial
> > Test code
> >
> > -krish
> >
> > #!/usr/bin/env python
> >
> >
> > from __future__ import print_function
> >
> >
> > import StringIO
> >
> >
> > from Bio.seqcluster.applications import DNAClustCommandline
> >
> > from Bio.seqcluster import DNAClustIterator
> >
> >
> > from Bio.seqcluster.applications import CDHITCommandline
> >
> > from Bio.seqcluster import CDHITClustIterator
> >
> >
> > cmd = DNAClustCommandline(similarity=0.8, header=True, threads=2,
> inputfile=
> > "test_sequences.fasta")
> >
> > stdout, stderr = cmd()
> >
> > clusters = DNAClustIterator(StringIO.StringIO(stdout))
> >
> > for cluster in clusters:
> >
> >     print(cluster.name)
> >
> >     for member in cluster:
> >
> >         if member == cluster.representative:
> >
> >             print("\t" + member.name + "*")
> >
> >         else:
> >
> >             print("\t" + member.name)
> >
> >
> > print()     # blank line
> >
> >
> > cmd = CDHITCommandline(cutoff=0.8, threads=2, inputfile=
> > "test_sequences.fasta", outputfile="tmp")
> >
> > stdout, stderr = cmd()
> >
> > clusters = CDHITClustIterator(open("tmp.clstr", "r"))
> >
> > for cluster in clusters:
> >
> >     print(cluster.name)
> >
> >     for member in cluster:
> >
> >         if member == cluster.representative:
> >
> >             print("\t" + member.name + "*")
> >
> >         else:
> >
> >             print("\t" + member.name)
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
>