[Bioperl-l] Bio::*Taxonomy* changes
Sendu Bala
bix at sendu.me.uk
Tue Jul 18 22:50:37 UTC 2006
Chris Fields wrote:
> ...
>> [regarding changes to Bio::Taxonomy::Node]
>>
>> Actually, I'm really strongly leaning toward getting rid of the
>> following methods and new() options (and giving up entirely on being
>> able to keep 'sapiens' somewhere):
>>
>> -organelle, organelle()
>> -division, division()
>> -sub_species, sub_species()
>> -variant, variant()
>> species(), validate_species_name()
>> genus()
>> binomial()
>
> Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to
> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes
> to Node will affect Bio::Species to some degree.
I see from the original postings that Node was intended to be like
Species, but I don't think it makes the slightest bit of sense. A
/single/ Node need only (must only!) represent the information for a
single node in the taxonomy. Or else what do these objects mean? What is
the object model? It's bad bad bad for it to be sensible one way (when
you're making your own taxonomy by making your own nodes) and
nonsensical another (when we stuff in methods so that Bio::Species is
happy). The way Node is written right now, and what you're suggesting,
is that we stuff the entire Taxonomy into the Node. Well, except that
you don't even have methods for every taxonomic level - there is genus()
but no subphylum(). I can't emphasise strongly enough how insane all
this is.
The correct thing for Bio::Species to interact with is Bio::Taxonomy.
Bio::Taxonomy is a collection of Nodes and has the sort of methods that
Bio::Species would need to delegate its current functionality.
I'm quite willing to do a proper overhaul here so everything makes
sense. You either make your own nodes and add these to a Taxonomy or use
a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy
lets you discover the classification of any node it contains.
Bio::Species could implement a method like genus() by:
$node = $taxonomy->get_node('genus') || return;
return $node->scientific_name;
Bio::Taxonomy isn't perfect, but I can certainly get it to do its job.
I'd probably make it rank-name and order independent for starters.
Bio::Taxonomy::Node needs to be reduced right down to just hold data
about the node it represents, and possibly its parent node id (or other
way of getting to its parent). So now I'm proposing dropping the
classification() method from Node as well. It's simply not necessary;
Bio::Taxonomy should give you that information.
Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment from
its docs, but it could be used to build a Taxonomy (that seems to be its
intent, I'm just not sure what some of the methods are really supposed
to do) such that Node might not even need any methods for getting its
parent or child nodes. The Factory or Taxonomy might be able to deal
with that.
In short, I'm proposing a major change to Bio::Taxonomy::Node (make it
just a node), and minor changes to (& implementation of) Bio::Taxonomy
and Bio::Taxonomy::FactoryI such that they actually get used to do their
jobs.
> That's also why I thought binomial() could stick around; if you have both
> the genus() and species() you could grab both using binomial(), building in
> special cases or error handling in case genus() or species() or both return
> undef.
binomial() would belong in (and is present in) Bio::Taxonomy. But in any
case, it's not needed there either; if you want the binomial you just
ask for the scientific_name of the species node in your Taxonomy, since
this now contains the actual scientific name == binomial.
binomial() in Bio::Taxonomy could be reimplemented as:
$node = $self->get_node('species') || return;
return $node->scientific_name;
>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank'
>> when they build the classification array. I had no intention of changing
>> this behaviour.
>
> If you ignore nodes with 'no rank' there will be major problems when
> retrieving certain TaxID's from protein/nucleotide sequences.
This is only for the classification array, which is meaningless anyway
(there only for file-format compatibility). If you want the real
information you ask your Bio::Taxonomy (which asks each of its nodes).
This is the whole point of having Bio::Taxonomy in the first place.
It gives you great flexibility to do whatever you want to do.
>>> <TaxId>1760</TaxId>
>>> <ScientificName>Actinobacteria (class)</ScientificName>
>>> <Rank>class</Rank>
>> Ugh. I guess my proposal to remove <> bits via flatfile extends to
>> removing () bits via entrez. We don't need unique names; we can use
>> object_id() when uniqueness matters.
>
> The XML parsing in Taxonomy::entrez will take care of the <tags> and retains
> the character data in between.
You misunderstood. I meant the <> bits I discussed at the very start of
this thread, that flatfile gives you. Here I'm referring to getting rid
of ' (class)' as well.
> Any way we go about it here (keeping certain methods and tossing others,
> changing the data returned, etc), it looks like there will be API issues
> down the road which will directly affect anyone using tax data. That
> affects bioperl-db directly as well as any other bioperl-based DB's which
> rely on tax data. So we need to tread a bit carefully when making major
> changes to make sure that they work for bioperl-db and anywhere else that
> may require it.
Does anything make serious use of the current Bio::Taxonomy code? Or are
they using Bio::Species?
More information about the Bioperl-l
mailing list