[Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw

Thu Jan 23 07:39:12 UTC 2014

Hi all,

I couldn't get a response about my struggles which I asked few days past, I
presume it was either a poorly submitted question or my approach with what
I want to do is totally out of touch with mainstream bioinformatics. The
thing is I am a newbie to both python programming and bioinformatics but I
believe there are people here who can help, so I will try again with more
background.

The overall goal with what I want to achieve is to perform selection
analyses on multiple species with codeml in PAML. For this the inputs
should be both the sequence alignments and tree files. I already have
sequence file (produced by pal2nal) but I still need a corresponding tree
file.

So what I am challenged with is the fact that my nucleotide alignment file
contain cds of four species at many loci (it is kind of whole genome data)
so I will have to submit the job to a tree producing program per each
alignment - I can use clustalw or Phylip.

Looking at biopython facility, thankfully there is biopython wrapper for
clustalw which I attempted to use for trees, but the fact that my alignment
file contains multiple alignments, I cannot use the code the way it is (the
straight code assumes the file contains a single alignment). So I reasoned
that I can couple this clustalw wrapper with a Dictionary facility to
output the desired results as so:

from Bio import SeqIO
> from Bio.Align.Applications import ClustalwCommandline
>
> def get_ids(record):
>     """"Given a SeqRecord, return the common number shared among sequence
> descriptions.
>             e.g. ">ENSBTAT00000009085_cow or ENSBTAT00000009085_goat or
> ENSBTAT00000009085_sheep
>             " -> "ENSBTAT00000009085"
>             """
>     parts = record.description[:18]
>     return parts
>
> myseq_dict = SeqIO.to_dict(SeqIO.parse("/home/edson/ungulate/infile.fa",
> "fasta"), key_function=get_ids)
> #print myseq_dict.keys()
> cline = ClustalwCommandline("clustalw2", infile="myseq_dict")
> stdout, stderr = clustalw_cline()
>

It turned out this code is a result of my naive (very naive) reasoning and
it is obvious why it cannot work. But I am just putting it here to give you
a clue of what I want to do. I'm sure there is a convenient way to do what
I want to do and I hope this forum will help. I apologize it is a long
email (english is not my first language, at times I'm being wordy to make
myself clear). Any resource will be appreciated.

Thanks.

Edson B. Ishengoma
PhD-Candidate
*School of Life Sciences and Engineering
Nelson Mandela African Institute of Science and Technology
Nelson Mandela Road
P. O. Box 447, Arusha
Tanzania (255)
*
*ishengomae at nm-aist.ac.tz  *ebarongo82 at yahoo.co.uk
*

<http://www.nm-aist.ac.tz/>Mobile: +255 762 348 037, +255 714 789 360,
  Website: www.nm-aist.ac.tz
Skype: edson.ishengoma

*
*
**