[Biopython-dev] interest module for sequence based clustering tools?

Mon Apr 14 09:20:25 UTC 2014

Hi Krishna,

This looks interesting, although it is not something I can remember
doing myself with Python before. Do you think the framework would
be general enough to cover other sequence clustering tools like
UCLUST http://drive5.com/usearch/manual/uclust_algo.html
or OrthoMCL?

Peter

On Tue, Apr 1, 2014 at 7:41 AM, Krishna Roskin <krishnaroskin at gmail.com> wrote:
> Hey all,
>
> Long time fan, taking my first crack at contributing.
>
> I've built a basic module to run and parse the result of sequence based
> clustering tools such as DNACLUST and CD-HIT. I've written subclasses of
> AbstractCommandline to run dnaclust and cd-hit. I've also written classes
> to store the clusters and their members and loaders for the output formats
> used by those programs.
>
> I posting here to gauge interest and get some feedback and maybe some beta
> testers.
>
> My code is available at:
>
> https://github.com/krishnaroskin/biopython.git
>
> under the seqcluster branch. I've started writing some test code at:
>
> Tests/seqcluster/test_seqcluster.py
>
> that also severs as example code. I've pasted that at the end of this
> message so people can get an idea of how it works without having to
> checkout code.
>
> If there is interest, I'm planning on adding a seqclust.cluster function
> that takes a list of SeqRecords and returns a clustering using one of the
> supported tools. I envision that function being the main interface to this
> module. I also want to write a something that will map the cluster
> membership (given by ids) back to collections of SeqRecords.
>
> Other to-dos:
>
> Test that all the flavors of CD-HIT work (there are many)
> Add support for other sequence based clustering tools (suggestions?)
> Documentation
> Tutorial
> Test code
>
> -krish
>
> #!/usr/bin/env python
>
>
> from __future__ import print_function
>
>
> import StringIO
>
>
> from Bio.seqcluster.applications import DNAClustCommandline
>
> from Bio.seqcluster import DNAClustIterator
>
>
> from Bio.seqcluster.applications import CDHITCommandline
>
> from Bio.seqcluster import CDHITClustIterator
>
>
> cmd = DNAClustCommandline(similarity=0.8, header=True, threads=2, inputfile=
> "test_sequences.fasta")
>
> stdout, stderr = cmd()
>
> clusters = DNAClustIterator(StringIO.StringIO(stdout))
>
> for cluster in clusters:
>
>     print(cluster.name)
>
>     for member in cluster:
>
>         if member == cluster.representative:
>
>             print("\t" + member.name + "*")
>
>         else:
>
>             print("\t" + member.name)
>
>
> print()     # blank line
>
>
> cmd = CDHITCommandline(cutoff=0.8, threads=2, inputfile=
> "test_sequences.fasta", outputfile="tmp")
>
> stdout, stderr = cmd()
>
> clusters = CDHITClustIterator(open("tmp.clstr", "r"))
>
> for cluster in clusters:
>
>     print(cluster.name)
>
>     for member in cluster:
>
>         if member == cluster.representative:
>
>             print("\t" + member.name + "*")
>
>         else:
>
>             print("\t" + member.name)
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev