[Bioperl-l] Bio::*Taxonomy* changes (Chris Fields)

Gabriel Valiente valiente at lsi.upc.edu
Fri Jul 28 16:35:20 UTC 2006


>> It is a general algorithm I devised that takes a set of paths and
>> builds up a tree. The input paths are full lineages coming from
>> Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why
>> I said I don't know exactly where it would belong, it looks to me
>> more like a standalone script than a Bio::Taxonomy or Bio::Tree  
>> method.
>>
>> Gabriel
>
> Agreed.  You could submit the script as an example here if it is  
> short, or
> via Bugzilla as an enhancement request:
>
> http://bugzilla.open-bio.org/
>
> It could be added to the scripts\ or examples\ directory in bioperl- 
> core.

Here it is. Please check it and include for instance as  
taxonomy2tree.PLS in the scripts/tree or scripts/taxonomy directory.  
Disclaimer: I'm also publishing part of this code in a conference paper.

The script is already fully functional but anyway, I have a couple of  
improvements in mind. The minor one is provision for cmdline input.  
How would you like to input an array of names?

The other one is to remove internal node labels and contract  
elementary paths, for instance reducing the tree:

(((((((((((((((((((((((((((("Pongo pygmaeus")Pongo,(("Gorilla  
gorilla")Gorilla,("Pan troglodytes")Pan,("Homo sapiens")Homo)"Homo/ 
Pan/Gorilla group")Hominidae)Hominoidea)Catarrhini)Simiiformes) 
Primates)Euarchontoglires)Eutheria)Theria)Mammalia)Amniota)Tetrapoda) 
Sarcopterygii)Euteleostomi)Teleostomi)"Gnathostomata <vertebrate>") 
Vertebrata)"Craniata <chordata>")Chordata)Deuterostomia)Coelomata) 
Bilateria)Eumetazoa)Metazoa)"Fungi/Metazoa group")Eukaryota)"cellular  
organisms")root;

to the tree:

("Pongo pygmaeus",("Gorilla gorilla","Pan troglodytes","Homo sapiens"));

It is certainly easy to remove all internal node labels. On the other  
hand, I've been working on contraction of elementary paths for quite  
a while, but always got stuck with internals of the Bio::Tree methods  
to remove nodes. Thus, so far the only working code I have consists  
of removing elementary branches while making a deep copy of the tree,  
which certainly is not quite elegant...

Thanks a lot,

Gabriel


#!/usr/bin/perl -w
# Author:  Gabriel Valiente <valiente at lsi.upc.edu>
# Purpose: Bio::DB::Taxonomy -> Bio::Tree::Tree

use strict;
use Bio::DB::Taxonomy;
use Bio::TreeIO;

my $nodesfile = "nodes.dmp";
my $namesfile = "names.dmp";
my $db = new Bio::DB::Taxonomy(-source => 'flatfile',
                                -directory => "./db/",
                                -nodesfile => $nodesfile,
                                -namesfile => $namesfile);

# the input to the script is an array of species names

my @species = ('Orangutan',
'Gorilla',
'Chimpanzee',
'Human');

my $root = new Bio::Tree::Node(-id => "root");
my $tree = new Bio::Tree::Tree(-root => $root);

# the full lineages of the species are merged into a tree

for my $name (@species) {
   my $ncbi_id = $db->get_taxonid($name);
   if ($ncbi_id) {
     my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id);
     my @lineage = get_lineage_nodes($node);
     shift @lineage; # discard root
     push @lineage, $node;
     merge_path($root, \@lineage);
   } else {
     warn "no NCBI Taxonomy node for species ",$name,"\n";
   }
}

# the tree is output in Newick format

my $output = new Bio::TreeIO(-format => 'newick');
$output->write_tree($tree);

# the actual merging of full lineages is performed by a recursive method

sub merge_path {
   my $root = shift;
   my $path = shift;
   my @path = @{$path};
   if (@path) {
     my $top = shift @path;
     my @children = grep { $_->id eq $top->node_name } $root- 
 >each_Descendent;
     if (@children) {
       # $root has a $child with id eq $top name
       my $child = shift @children;
       merge_path($child,\@path);
     } else {
       # add $top and @path below $root
       my $node = $root;
       unshift @path, $top;
       while (@path) {
         my $top = shift @path;
         my $name = $top->node_name;
         my $child = new Bio::Tree::Node(-id => "$name");
         $node->add_Descendent($child);
         $node = $child;
       }
     }
   }
}

# the full lineage of a species is recovered by traversing the taxonomy

sub get_lineage_nodes{
   my $node = shift;
   my @lineage;
   while ($node->node_name ne "root") {
     $node = $node->get_Parent_Node;
     unshift @lineage, $node;
   }
   return @lineage;
}

=head1 NAME

taxonomy2tree - builds a taxonomic tree based on the full lineages of  
a set of species names

=head1 DESCRIPTION

This script requires that the bioperl-run pkg be also installed.

Providing the nodes.dmp and names.dmp files from the NCBI Taxonomy
dump (see Bio::DB::Taxonomy::flatfile for more info) is only necessary
on the first time running.  This will create the local indexes and may
take quite a long time.  However once created, these indexes will
allow fast access for species to taxon id OR taxon id to species name
lookups.

=cut






More information about the Bioperl-l mailing list