[Bioperl-l] Taxonomy hierarchy extraction
Sendu Bala
bix at sendu.me.uk
Tue Jun 19 14:25:26 UTC 2007
Hilmar Lapp wrote:
> So the real mistake was to write
>
> my $node = $db->get_Taxonomy_Node(-taxonid => '33090');
> my @extant_children = grep { $_->is_Leaf } $node->get_all_Descendents;
>
> instead of
>
> my $node = $db->get_Taxonomy_Node(-taxonid => '33090');
> my @extant_children = grep { $_->is_Leaf } $db->get_all_Descendents
> ($node);
>
> I.e., the Bio::DB::Taxonomy object *will* (or is allowed to) ask the
> database?
Yes, the database object methods use the database. I don't even think it
makes sense to question that. What else would it do?
> If this is correct, can we highlight this in the documentation? It's
> a small difference that everyone failed to spot.
The documentation for what? I've already clearly pointed out the gotcha
in Bio::Taxon.
> Also, in my reading of Bio::Taxonomy::Taxon it won't use the database
> either for ancestor(). Which would be consistent with its other methods.
Bio::Taxonomy::Taxon? Its deprecated. Bio::Taxon is what we're dealing
with, and it /does/ use the db to get the ancestor, unless the ancestor
is manually set (see below for explanation).
> I.e., the bottom line is don't use Node or Taxon objects for
> hierarchy queries that you expect to use an underlying database, use
> the Bio::DB::Taxonomy object instead. It makes sense, but is it true?
Almost. It happens to be true but ideally wouldn't be the case. The
confusion and problems arise, I guess, because we have two ways to
access/create hierarchies and both of them are built from the same
building block (Bio::Taxon objects).
On the one hand we have Bio::DB::Taxonomy and the other we have
Bio::Tree::Tree.
Tree objects are easy: you have a Taxon object created in memory for
each and every node in the tree. Each Taxon knows its ancestor and
descendants by storing references to the relevant Taxon objects in the
tree. You 'navigate' through the tree by grabbing a Taxon inside it and
asking the Taxon itself for its ancestor or descendant.
This leaves us with the Taxon object having the methods ancestor() and
each_Descendent(), which we'll expect to work in other circumstances.
Bio::DB::Taxonomy returns single Taxon objects from the database on
request. Now we still expect our ancestor() and each_Descendent()
methods to work, but if things were set up like Bio::Tree::Tree we'd end
up pulling the entire database into memory because we'd have to create
all the Taxon objects that are ancestors and descendants, recursively,
every time we request a single Taxon (which is wasteful in the case of
Bio::DB::Taxonomy::flatfile and slow/not allowed in the case of
Bio::DB::Taxonomy::entrez).
The solution? We simply don't create the immediate ancestor or
descendant Taxon objects of the requested Taxon, and instead implement
the Taxon methods to ask the database to create them on demand, if they
don't already exist. Well, that idea is fine (and necessary) for the
ancestor method, but we run into problems with each_Descendent().
The problem arises when we create Bio::Tree::Tree objects from a Taxon
we got from the database. Being able to do that is why Bio::Taxon is
shared between them, as it is a very desirable thing to do: you can
instantly create a lineage tree for a Taxon of interest and then use all
the Bio::Tree::Tree methods on it. Unfortunately one of those methods is
get_nodes() which is implemented using each_Descendent() and
get_all_Descendents(). If each_Descendent() asked the database for the
real answer, we'd end up pulling the entire database into the tree.
So my implementation was to not ask the database and just warn people in
the docs. Ideally it /would/ use the database, because that's what a
user would expect. Can anyone see an alternate way around the problem?
More information about the Bioperl-l
mailing list