[Bioperl-l] Taxonomy hierarchy extraction

Tue Jun 19 16:14:38 UTC 2007

Sorry I was accidentally looking at an older branch.

Reading through the Taxon module I get more confused though than  
would leave me at ease.

Here's what I understand of your description of the problem:

- We would like nodes returned from Bio::DB::Taxonomy to use the  
database for all hierarchical queries.

- We would like nodes used in a Bio::Tree::Tree not to use the  
database for any hierarchical query.

What I understand that we have is

- Taxon node objects that have a db_handle set will use the database  
for ancestor(), unless it has been set manually (?), but not for  
each_Descendent().

- Taxon node objects that don't have a db_handle set won't use a  
database but will function normally otherwise.

- This is needed to prevent Bio::Tree::Tree methods from pulling the  
entire tree into memory.

If this is correct (I'm not sure it is), it sounds like we want to  
temporarily divorce taxonomy nodes from their database capabilities  
while they are being queried in a tree context?

I'm still trying to understand - if I create a Bio::Tree::Tree from a  
single node, will the tree automatically contain all nodes along the  
lineage of ancestors up to the root? So, even if extracting this  
lineage involved querying a database it would be acceptable, but not  
for querying descendents?

It sounds to me like what is needed is that nodes that get added to a  
tree need to be stripped of their database capabilities. This could  
be achieved by creating a wrapper class that delegates all non- 
hierarchical methods to the wrapped Taxon object, and overriding all  
hierarchical queries to not use a database. I'm not sure I fully  
understand yet though, but the inconsistent behavior will be sure to  
throw people off track.

	-hilmar

On Jun 19, 2007, at 10:25 AM, Sendu Bala wrote:

> Hilmar Lapp wrote:
>> So the real mistake was to write
>>   my $node = $db->get_Taxonomy_Node(-taxonid => '33090');
>>   my @extant_children = grep { $_->is_Leaf } $node- 
>> >get_all_Descendents;
>> instead of
>>   my $node = $db->get_Taxonomy_Node(-taxonid => '33090');
>>   my @extant_children = grep { $_->is_Leaf } $db- 
>> >get_all_Descendents ($node);
>> I.e., the Bio::DB::Taxonomy object *will* (or is allowed to) ask  
>> the  database?
>
> Yes, the database object methods use the database. I don't even  
> think it makes sense to question that. What else would it do?
>
>
>> If this is correct, can we highlight this in the documentation?  
>> It's  a small difference that everyone failed to spot.
>
> The documentation for what? I've already clearly pointed out the  
> gotcha in Bio::Taxon.
>
>
>> Also, in my reading of Bio::Taxonomy::Taxon it won't use the  
>> database  either for ancestor(). Which would be consistent with  
>> its other methods.
>
> Bio::Taxonomy::Taxon? Its deprecated. Bio::Taxon is what we're  
> dealing with, and it /does/ use the db to get the ancestor, unless  
> the ancestor is manually set (see below for explanation).
>
>
>> I.e., the bottom line is don't use Node or Taxon objects for   
>> hierarchy queries that you expect to use an underlying database,  
>> use  the Bio::DB::Taxonomy object instead. It makes sense, but is  
>> it true?
>
> Almost. It happens to be true but ideally wouldn't be the case. The  
> confusion and problems arise, I guess, because we have two ways to  
> access/create hierarchies and both of them are built from the same  
> building block (Bio::Taxon objects).
>
> On the one hand we have Bio::DB::Taxonomy and the other we have  
> Bio::Tree::Tree.
>
> Tree objects are easy: you have a Taxon object created in memory  
> for each and every node in the tree. Each Taxon knows its ancestor  
> and descendants by storing references to the relevant Taxon objects  
> in the tree. You 'navigate' through the tree by grabbing a Taxon  
> inside it and asking the Taxon itself for its ancestor or descendant.
>
> This leaves us with the Taxon object having the methods ancestor()  
> and each_Descendent(), which we'll expect to work in other  
> circumstances.
>
> Bio::DB::Taxonomy returns single Taxon objects from the database on  
> request. Now we still expect our ancestor() and each_Descendent()  
> methods to work, but if things were set up like Bio::Tree::Tree  
> we'd end up pulling the entire database into memory because we'd  
> have to create all the Taxon objects that are ancestors and  
> descendants, recursively, every time we request a single Taxon  
> (which is wasteful in the case of Bio::DB::Taxonomy::flatfile and  
> slow/not allowed in the case of Bio::DB::Taxonomy::entrez).
>
> The solution? We simply don't create the immediate ancestor or  
> descendant Taxon objects of the requested Taxon, and instead  
> implement the Taxon methods to ask the database to create them on  
> demand, if they don't already exist. Well, that idea is fine (and  
> necessary) for the ancestor method, but we run into problems with  
> each_Descendent().
>
> The problem arises when we create Bio::Tree::Tree objects from a  
> Taxon we got from the database. Being able to do that is why  
> Bio::Taxon is shared between them, as it is a very desirable thing  
> to do: you can instantly create a lineage tree for a Taxon of  
> interest and then use all the Bio::Tree::Tree methods on it.  
> Unfortunately one of those methods is get_nodes() which is  
> implemented using each_Descendent() and get_all_Descendents(). If  
> each_Descendent() asked the database for the real answer, we'd end  
> up pulling the entire database into the tree.
>
> So my implementation was to not ask the database and just warn  
> people in the docs. Ideally it /would/ use the database, because  
> that's what a user would expect. Can anyone see an alternate way  
> around the problem?

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================