[Bioperl-l] Taxonomy hierarchy extraction

Tue Jun 19 14:25:26 UTC 2007

Hilmar Lapp wrote:
> So the real mistake was to write
> 
>   my $node = $db->get_Taxonomy_Node(-taxonid => '33090');
>   my @extant_children = grep { $_->is_Leaf } $node->get_all_Descendents;
> 
> instead of
> 
>   my $node = $db->get_Taxonomy_Node(-taxonid => '33090');
>   my @extant_children = grep { $_->is_Leaf } $db->get_all_Descendents 
> ($node);
> 
> I.e., the Bio::DB::Taxonomy object *will* (or is allowed to) ask the  
> database?

Yes, the database object methods use the database. I don't even think it 
makes sense to question that. What else would it do?

> If this is correct, can we highlight this in the documentation? It's  
> a small difference that everyone failed to spot.

The documentation for what? I've already clearly pointed out the gotcha 
in Bio::Taxon.

> Also, in my reading of Bio::Taxonomy::Taxon it won't use the database  
> either for ancestor(). Which would be consistent with its other methods.

Bio::Taxonomy::Taxon? Its deprecated. Bio::Taxon is what we're dealing 
with, and it /does/ use the db to get the ancestor, unless the ancestor 
is manually set (see below for explanation).

> I.e., the bottom line is don't use Node or Taxon objects for  
> hierarchy queries that you expect to use an underlying database, use  
> the Bio::DB::Taxonomy object instead. It makes sense, but is it true?

Almost. It happens to be true but ideally wouldn't be the case. The 
confusion and problems arise, I guess, because we have two ways to 
access/create hierarchies and both of them are built from the same 
building block (Bio::Taxon objects).

On the one hand we have Bio::DB::Taxonomy and the other we have 
Bio::Tree::Tree.

Tree objects are easy: you have a Taxon object created in memory for 
each and every node in the tree. Each Taxon knows its ancestor and 
descendants by storing references to the relevant Taxon objects in the 
tree. You 'navigate' through the tree by grabbing a Taxon inside it and 
asking the Taxon itself for its ancestor or descendant.

This leaves us with the Taxon object having the methods ancestor() and 
each_Descendent(), which we'll expect to work in other circumstances.

Bio::DB::Taxonomy returns single Taxon objects from the database on 
request. Now we still expect our ancestor() and each_Descendent() 
methods to work, but if things were set up like Bio::Tree::Tree we'd end 
up pulling the entire database into memory because we'd have to create 
all the Taxon objects that are ancestors and descendants, recursively, 
every time we request a single Taxon (which is wasteful in the case of 
Bio::DB::Taxonomy::flatfile and slow/not allowed in the case of 
Bio::DB::Taxonomy::entrez).

The solution? We simply don't create the immediate ancestor or 
descendant Taxon objects of the requested Taxon, and instead implement 
the Taxon methods to ask the database to create them on demand, if they 
don't already exist. Well, that idea is fine (and necessary) for the 
ancestor method, but we run into problems with each_Descendent().

The problem arises when we create Bio::Tree::Tree objects from a Taxon 
we got from the database. Being able to do that is why Bio::Taxon is 
shared between them, as it is a very desirable thing to do: you can 
instantly create a lineage tree for a Taxon of interest and then use all 
the Bio::Tree::Tree methods on it. Unfortunately one of those methods is 
get_nodes() which is implemented using each_Descendent() and 
get_all_Descendents(). If each_Descendent() asked the database for the 
real answer, we'd end up pulling the entire database into the tree.

So my implementation was to not ask the database and just warn people in 
the docs. Ideally it /would/ use the database, because that's what a 
user would expect. Can anyone see an alternate way around the problem?