[Bioperl-l] Bio::*Taxonomy* changes
Sendu Bala
bix at sendu.me.uk
Wed Jul 26 09:19:29 UTC 2006
Chris Fields wrote:
>
>> It seems like the main problem with Node right now is that it has
>> classification() and things like genus(). I propose pure Node method
>> solutions to answer the questions classification() and genus() were
>> implemented to answer, but in a better, cruft-free way.
>>
>> Bio::DB::Taxonomy::genbank anyone?
>
> Ach... You're compromising here;
No, I don't think so. Let me explain...
(another very long email, but with the same conclusion as above)
> 1) Switch out Bio::Species with Node or Taxonomy; relocate other
> information temporarily (Bio::Species, get/sets in Seq object,
> SimpleValue). Leave Bio::Species in for the time being, but don't
> bother making any additional changes to it.
[...]
> Hence Hilmar's suggestion to use a $seq->taxon() method to return a
> Node/Taxonomy, and a $seq->species() would still return a
> Bio::Species object. It's redundant,
As I see it, the problem to be solved is this:
a) A node should just be a node, holding only information about itself
(but this can include information on who its parent is, and methods
relating to getting its parents/children as new objects - but the data
of its parents/children must never be stored on itself).
b) Bio::Species isn't very good at its job; you can't ask reasonable
taxonomic questions of it and get correct answers.
c) We need to transition Bio::Species to something better - something
that lets us do the same job as Bio::Species, but do it better. An
important aspect of 'better' is that we can switch from the taxonomic
information in a genbank file or similar to the information in a
taxonomic database if we want certain taxonomic questions answered
correctly. But also, we should be able to answer all questions with a
good chance of a correct answer even without database access/installation.
There are a variety of possible solutions. How can we decide which is
best? What would a good solution be?
The 'something better' we transition Bio::Species to will become the
preferred (or at least de facto standard) way of dealing with taxonomic
information in bioperl. This taxonomic module (or set of modules) must
be able to model taxonomic information anywhere it is found - databases
or genbank files or anything else. If it can't, it would be
fundamentally flawed.
d) We can immediately discount any solution that involves storing some
taxonomic information outside of the tax module. If we find ourselves
putting lineage data in a genbank file in SimpleValue objects or
similar, we can be pretty sure we've used a poor solution to the
problem. That would be a compromise.
e) If the thing we transition Bio::Species to can't do everything
Bio::Species did (doing it in a different and better way is fine of
course), it's not suitable for transitioning to (this is why Node needed
all the cruft added to it before it was a suitable candidate). If it
/can/ do everything Bio::Species did, there would be no harm immediately
making Bio::Species inherit from the new tax module, reimplementing
Bio::Species as necessary but making no API change. So any solution that
would /require/ $seq->taxon() and $seq->species() wouldn't be a good
one, and would be a compromise. But we do want to get rid of
Bio::Species eventually, so I'm not saying we shouldn't have a
$seq->taxon() or similar, only that either method would give you the
same type of object with the same methods available
($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species')
&& $seq->species->isa('tax module')).
I see 2 possible solutions to the problem. What should 'tax module' be?:
1) Bio::Taxonomy or other similar class that is a container of multiple
nodes. Naively this makes logical sense since one of the jobs
Bio::Species has is to store a lineage, and a lineage is best
represented as a set of Nodes. So let's have a single object with all
our Nodes in it. Problems:
Bio::Taxonomy itself, as currently written, is fundamentally flawed. It
requires that you know the ranks and order of ranks of all your input
nodes before you input them. It requires that all ranks have unique
names. It doesn't handle ranks of 'no rank'. You can't have more than
one lineage in an instance because you can't have two nodes with the
same rank. If you don't know the ranks of your nodes (ie. genbank) there
is no way to maintain the order of your lineage because there is no
modelling of parent/child.
I had planned to re-write it such that the rank-centric implementation
was removed and we had parent/child implementation instead. But then
there is nothing to stop you adding nodes that are disconnected from the
others, creating a broken mess.
Bio::Taxonomy::Tree might have been a little more suitable because it
implements Bio::Tree::TreeI, but sadly it is also rank-centric and
actually requires input of both Bio::Species and Bio::Taxonomy objects
to its most useful methods.
More important than issues with current implementations of
node-container classes, such classes are unable to let us solve problem
c) in a good way, and also leave us potentially storing in memory Node
objects representing the same taxonomic node multiple times in different
instances of the node-container. For problem c) if we were to switch
from genbank nodes to database the solution is to delete all the nodes
in the container and then get them all again from the database. What if
you didn't even have a lineage-related question? You've just retrieved
10s of nodes from the database for no reason (and then store them), when
all you wanted was accurate information on the node you were interested in.
All in all, it's pretty horrible. Unsuitable implementations plus excess
database retrieval plus massive waste of memory with duplicated nodes
does not equal a good solution.
2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of
methods binomial(), species(), genus(), sub_species(),
variant(), organelle(), classification() and show_all(). Except for
organelle() which doesn't belong in taxonomy, all of these Bio::Species
'questions' can still be answered by Node - just not in a single method
call. I outlined how to answer them in the previous post. For backward
compatibility make Bio::Species a Node and implement the suggested way
of answering the questions the proper 'Node' way under those methods.
Problems:
Well, those questions can't actually be answered by Node if the starting
point was genbank data or manually created Nodes. The solution is clean
and simple: Bio::DB::Taxonomy::genbank or perhaps better named
Bio::DB::Taxonomy::list (because it makes a taxonomy database from an
ordered list of names - I don't see anything inherently wrong or ugly
with that). Then everything magically just works. We get all the power
to ask all our questions that Node has already when working with the
ncbi database, but we get it when working with genbank data. We suffer
none of the problems of a node-container class. We can easily switch
databases on the fly.
What's not to like?
More information about the Bioperl-l
mailing list