[Bioperl-l] Bio::Taxonomy changes

Wed Jul 26 09:19:29 UTC 2006

Chris Fields wrote:
>
>> It seems like the main problem with Node right now is that it has
>> classification() and things like genus(). I propose pure Node method
>> solutions to answer the questions classification() and genus() were
>> implemented to answer, but in a better, cruft-free way.
>>
>> Bio::DB::Taxonomy::genbank anyone?
> 
> Ach...  You're compromising here;

No, I don't think so. Let me explain...
(another very long email, but with the same conclusion as above)

> 1) Switch out Bio::Species with Node or Taxonomy; relocate other  
> information temporarily (Bio::Species, get/sets in Seq object,  
> SimpleValue).  Leave Bio::Species in for the time being, but don't  
> bother making any additional changes to it.
[...]
> Hence Hilmar's suggestion to use a $seq->taxon() method to return a  
> Node/Taxonomy, and a $seq->species() would still return a  
> Bio::Species object.  It's redundant,

As I see it, the problem to be solved is this:

a) A node should just be a node, holding only information about itself 
(but this can include information on who its parent is, and methods 
relating to getting its parents/children as new objects - but the data 
of its parents/children must never be stored on itself).

b) Bio::Species isn't very good at its job; you can't ask reasonable 
taxonomic questions of it and get correct answers.

c) We need to transition Bio::Species to something better - something 
that lets us do the same job as Bio::Species, but do it better. An 
important aspect of 'better' is that we can switch from the taxonomic 
information in a genbank file or similar to the information in a 
taxonomic database if we want certain taxonomic questions answered 
correctly. But also, we should be able to answer all questions with a 
good chance of a correct answer even without database access/installation.

There are a variety of possible solutions. How can we decide which is 
best? What would a good solution be?

The 'something better' we transition Bio::Species to will become the 
preferred (or at least de facto standard) way of dealing with taxonomic 
information in bioperl. This taxonomic module (or set of modules) must 
be able to model taxonomic information anywhere it is found - databases 
or genbank files or anything else. If it can't, it would be 
fundamentally flawed.

d) We can immediately discount any solution that involves storing some 
taxonomic information outside of the tax module. If we find ourselves 
putting lineage data in a genbank file in SimpleValue objects or 
similar, we can be pretty sure we've used a poor solution to the 
problem. That would be a compromise.

e) If the thing we transition Bio::Species to can't do everything 
Bio::Species did (doing it in a different and better way is fine of 
course), it's not suitable for transitioning to (this is why Node needed 
all the cruft added to it before it was a suitable candidate). If it 
/can/ do everything Bio::Species did, there would be no harm immediately 
making Bio::Species inherit from the new tax module, reimplementing 
Bio::Species as necessary but making no API change. So any solution that 
would /require/ $seq->taxon() and $seq->species() wouldn't be a good 
one, and would be a compromise. But we do want to get rid of 
Bio::Species eventually, so I'm not saying we shouldn't have a 
$seq->taxon() or similar, only that either method would give you the 
same type of object with the same methods available 
($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') 
&& $seq->species->isa('tax module')).

I see 2 possible solutions to the problem. What should 'tax module' be?:

1) Bio::Taxonomy or other similar class that is a container of multiple 
nodes. Naively this makes logical sense since one of the jobs 
Bio::Species has is to store a lineage, and a lineage is best 
represented as a set of Nodes. So let's have a single object with all 
our Nodes in it. Problems:

Bio::Taxonomy itself, as currently written, is fundamentally flawed. It 
requires that you know the ranks and order of ranks of all your input 
nodes before you input them. It requires that all ranks have unique 
names. It doesn't handle ranks of 'no rank'. You can't have more than 
one lineage in an instance because you can't have two nodes with the 
same rank. If you don't know the ranks of your nodes (ie. genbank) there 
is no way to maintain the order of your lineage because there is no 
modelling of parent/child.
I had planned to re-write it such that the rank-centric implementation 
was removed and we had parent/child implementation instead. But then 
there is nothing to stop you adding nodes that are disconnected from the 
others, creating a broken mess.

Bio::Taxonomy::Tree might have been a little more suitable because it 
implements Bio::Tree::TreeI, but sadly it is also rank-centric and 
actually requires input of both Bio::Species and Bio::Taxonomy objects 
to its most useful methods.

More important than issues with current implementations of 
node-container classes, such classes are unable to let us solve problem 
c) in a good way, and also leave us potentially storing in memory Node 
objects representing the same taxonomic node multiple times in different 
instances of the node-container. For problem c) if we were to switch 
from genbank nodes to database the solution is to delete all the nodes 
in the container and then get them all again from the database. What if 
you didn't even have a lineage-related question? You've just retrieved 
10s of nodes from the database for no reason (and then store them), when 
all you wanted was accurate information on the node you were interested in.

All in all, it's pretty horrible. Unsuitable implementations plus excess 
database retrieval plus massive waste of memory with duplicated nodes 
does not equal a good solution.

2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of 
methods binomial(), species(), genus(), sub_species(),
variant(), organelle(), classification() and show_all(). Except for 
organelle() which doesn't belong in taxonomy, all of these Bio::Species 
'questions' can still be answered by Node - just not in a single method 
call. I outlined how to answer them in the previous post. For backward 
compatibility make Bio::Species a Node and implement the suggested way 
of answering the questions the proper 'Node' way under those methods. 
Problems:

Well, those questions can't actually be answered by Node if the starting 
point was genbank data or manually created Nodes. The solution is clean 
and simple: Bio::DB::Taxonomy::genbank or perhaps better named 
Bio::DB::Taxonomy::list (because it makes a taxonomy database from an 
ordered list of names - I don't see anything inherently wrong or ugly 
with that). Then everything magically just works. We get all the power 
to ask all our questions that Node has already when working with the 
ncbi database, but we get it when working with genbank data. We suffer 
none of the problems of a node-container class. We can easily switch 
databases on the fly.

What's not to like?

[Bioperl-l] Bio::*Taxonomy* changes

[Bioperl-l] Bio::Taxonomy changes