[Bioperl-l] Bio::*Taxonomy* changes

Chris Fields cjfields at uiuc.edu
Wed Jul 26 12:16:29 UTC 2006


> ...
>
> I see 2 possible solutions to the problem. What should 'tax module'  
> be?:
>
> 1) Bio::Taxonomy or other similar class that is a container of  
> multiple
> nodes. Naively this makes logical sense since one of the jobs
> Bio::Species has is to store a lineage, and a lineage is best
> represented as a set of Nodes. So let's have a single object with all
> our Nodes in it. Problems:
>
> Bio::Taxonomy itself, as currently written, is fundamentally  
> flawed. It
> requires that you know the ranks and order of ranks of all your input
> nodes before you input them. It requires that all ranks have unique
> names. It doesn't handle ranks of 'no rank'. You can't have more than
> one lineage in an instance because you can't have two nodes with the
> same rank. If you don't know the ranks of your nodes (ie. genbank)  
> there
> is no way to maintain the order of your lineage because there is no
> modelling of parent/child.
> I had planned to re-write it such that the rank-centric implementation
> was removed and we had parent/child implementation instead. But then
> there is nothing to stop you adding nodes that are disconnected  
> from the
> others, creating a broken mess.



>
> Bio::Taxonomy::Tree might have been a little more suitable because it
> implements Bio::Tree::TreeI, but sadly it is also rank-centric and
> actually requires input of both Bio::Species and Bio::Taxonomy objects
> to its most useful methods.
>
> More important than issues with current implementations of
> node-container classes, such classes are unable to let us solve  
> problem
> c) in a good way, and also leave us potentially storing in memory Node
> objects representing the same taxonomic node multiple times in  
> different
> instances of the node-container. For problem c) if we were to switch
> from genbank nodes to database the solution is to delete all the nodes
> in the container and then get them all again from the database.  
> What if
> you didn't even have a lineage-related question? You've just retrieved
> 10s of nodes from the database for no reason (and then store them),  
> when
> all you wanted was accurate information on the node you were  
> interested in.
>
> All in all, it's pretty horrible. Unsuitable implementations plus  
> excess
> database retrieval plus massive waste of memory with duplicated nodes
> does not equal a good solution.
>
>
> 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of
> methods binomial(), species(), genus(), sub_species(),
> variant(), organelle(), classification() and show_all(). Except for
> organelle() which doesn't belong in taxonomy, all of these  
> Bio::Species
> 'questions' can still be answered by Node - just not in a single  
> method
> call. I outlined how to answer them in the previous post. For backward
> compatibility make Bio::Species a Node and implement the suggested way
> of answering the questions the proper 'Node' way under those methods.
> Problems:
>
> Well, those questions can't actually be answered by Node if the  
> starting
> point was genbank data or manually created Nodes. The solution is  
> clean
> and simple: Bio::DB::Taxonomy::genbank or perhaps better named
> Bio::DB::Taxonomy::list (because it makes a taxonomy database from an
> ordered list of names - I don't see anything inherently wrong or ugly
> with that). Then everything magically just works. We get all the power
> to ask all our questions that Node has already when working with the
> ncbi database, but we get it when working with genbank data. We suffer
> none of the problems of a node-container class. We can easily switch
> databases on the fly.

That 'broken mess' (referring to Bio::Taxonomy) is up to the user.   
You could make it more stringent (i.e. only allow connected nodes,  
starting with a single initiating node then build from there), though  
I don't think that's necessary as most people would probably use some  
sort of factory to generate a taxonomy (a warning might be  
appropriate).  You would have to watch out for potential circular  
structures. Have it do what you want.  I believe the original intent  
of Taxonomy was to allow building a full-fledged taxonomic structure,  
so it should stay that way.

Sendu, you have to realize this is up to how you want to implement  
it.  We're giving you the freedom to do what you want to  
Bio::Taxonomy.  Of course, if we think you're off we'll reel you back  
in, but you seem to be on the right track.  Realize that the only  
contentious issue here is that horrible lineage line in the GenBank  
file.  We should have a way to rebuild it as it was from the original  
file (i.e. not rebuild it from scratch with DB lookups by default).   
However, you should also have the option to rebuild it from lookups  
(i.e. correctly), which you could do with a Taxonomy.

Note this Bio::Taxonomy method:

        classify

         Title   : classify
         Usage   : @obj[][0-1] = taxonomy->classify($species);
         Function: return a ranked classification
         Returns : @obj of taxa and ranks as word pairs separated by "@"
         Args    : Bio::Species object

As Bio::Species will be deprecated, you can use that method in a  
dual, sneaky way: 1) directly store the lineage information, 2)  
return the real one (DB lookups) if needed (i,e, if some flag is set,  
for instance).  And, if a Bio::Species argument is used, do what the  
docs state (catch it early on with an if block and return within  
it).  Bio::Species, as used within genbank.pm, doesn't use  
Bio::Taxonomy in any way.   I don't know if you even need to retain  
its original purpose here; you might be able to get away with  
changing the fundamental way this method works altogether.  That's up  
to you.

my 2c

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list