[Bioperl-l] Bio::*Taxonomy* changes

Sendu Bala bix at sendu.me.uk
Thu Jul 20 13:35:43 UTC 2006


Sendu Bala wrote:
> node 2 has name 'Bacteria <bacteria>' and rank 'superkingdom'
> node 1386 has name 'Bacillus <bacterium>' and rank 'genus'
> node 7776 has name 'Gnathostomata <vertebrate>' and rank 'superclass'
> etc.
> 
> For me the bits in <> are inappropriate and shouldn't be there.
> [...]
> If there are no objections I'll strip the <> bits. I also plan to make 
> $node->name('scientific', 'sapiens'); set and get the node name, and 
> have flatfile and entrez store all common names with 
> $obj->name('common', 'human', 'man');.

I'll describe all the changes I've now made and if no-one complains I'll 
commit. (I've also made these notes into bug 2047 for easier reference 
in the future.)

Bio::DB::Taxonomy::flatfile
---------------------------

# Bug-fixes
Removed invalid requirement that all species nodes have at least 7 
named-rank parents.

The names->id solution used by get_taxonid() only stored that last id 
associated with a name. However the name used wasn't necessarily unique, 
such that multiple ids could match. names->id solution now remembers all 
ids that match a name.
API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() 
and it returns an array of ids in list context. For backward 
compatibility it returns one of the ids in scalar context, and 
*get_taxonid = \&get_taxonids.

Added missing division ENV 'Environmental samples'.

# Improvements
Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the 
common names, genetic code and mitochondrial genetic code in each node 
it makes.
NOTE: entrez also stores creation, publication and update dates, but 
this data is not available in the taxdump from NCBI ftp site.
NOTE: the common names are stored in no particular order; the genbank 
common name in particular isn't necessarily the first in the list (cf. 
old entrez.pm behaviour).

BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the 
division as a three letter code, like 'PRI'. However, for consistency 
with entrez and the scientific_name() of the node the division is 
supposed to correspond to, it is now stored as the full name, like 
'Primates'.

The names->id solution also stores the artificially uniqued names like 
'Craniata <chordata>', allowing you for the first time to retrieve the 
correct id. Previously the search would have simply failed completely.

The names->id solution now handles nodes with scientific names of 'xyz 
(class)', allowing you to retrieve the id with both get_taxonids('xyz') 
and get_taxonids('xyz (class)'). Previously only the latter would work.

NOTE: the previous 2 changes (and the issues with entrez, see below) 
make flatfile better at searching the taxonomy database than entrez 
module or the website, both in terms of speed and completeness of results.

BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, 
always being sent directly to Bio::Taxonomy::Node->new(-name => 
$untouched) or the $node->classification() array. Previously, a species 
node would have its name converted from 'Homo sapiens' to 'sapiens', but 
the conversion mangled very badly certain other species names.


Bio::DB::Taxonomy::entrez
-------------------------

# Bug-fixes
Special characters like ", ( and ) in the input query string to 
get_taxonid() result in the failure or inaccuracy of the search. These 
characters are now removed prior to submission, allowing for correct 
search results.
API-CHANGE: entrez has always been able to return multiple ids that 
match a single input name, so I've renamed get_taxonid() to 
get_taxonids() and it returns an array of ids in list context. It 
returns one of the ids in scalar context. For backward compatibility, 
*get_taxonid = \&get_taxonids.
NOTE: entrez modules (and website) cannot cope with '<something>' in the 
query, failing searches like 'Craniata <chordata>'. For this reason, if 
get_taxonids() is given a query with '<something>' it will immediately 
return undefined, saving a pointless website access. If you want the id 
of 'Craniata <chordata>' you must search for 'Craniata', then get the 
node for each returned id to see which one has a parent node with a 
scientific_name() or common_names() case-insensitive matching to 'chordata'.

# Improvements
BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website.

BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ 
\(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => 
$untouched) or the $node->classification() array. Previously, a species 
node would have its name converted from 'Homo sapiens' to 'sapiens', but 
the conversion mangled very badly certain other species names.

BEHAVIOUR-CHANGE: all common names of a node are now stored in the 
resulting Node object with Bio::Taxonomy::Node->new(-common_names => 
\@names). This means that the Genbank common name is now just one 
amongst others, and isn't guaranteed to be the first in the list either.


Bio::Taxonomy::Node
-------------------

# Bug-fixes
non-interesting fixes to get get_Children_Nodes(), get_Lineage_Nodes() 
and get_LCA_Node() to work correctly.

classification() has a proper solution to finding the classification 
when the array wasn't manually set.

# Improvements
BEHAVIOUR-CHANGE: node_name() used to be an alias to name('common'). Now 
it is an alias to name('scientific').
NOTE: node_name is what is set when ->new(-name => $name) is set, so 
flatfile and entrez and user-created nodes now implicitly associate the 
name of the node they create with its scientific name.

BEHAVIOUR-CHANGE: scientific_name() used to be an alias to binomial(). 
Now it is *scientific_name = \&node_name.

binomial(), in addition to working the old way (assume first two 
elements of classification array are species and genus, combine them), 
will shortcut and return the scientific_name() if we are a node with 
rank 'species' and scientific_name is two words. This makes binomial() 
an effective synonym of scientific_name() when Nodes were constructed as 
per flatfile or entrez, and when it is used correctly on a species node.

BEHAVIOUR-CHANGE: *parent_taxon_id = \&parent_id. (Previously, you could 
assign and retrieve different values to/from each method.)

New method common_names() supersedes common_name(), returning a list of 
all common_names. For backward compatibility, returns one of the names 
in scalar context, and *common_name = \&common_names.

-factory and factory() removed, since there is no
Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use
of a factory once set, and a factory seems redundant when we're a node
with a -dbh.

species() and genus() issue a warning when you try to use them on a node 
that isn't of rank 'species' (since they interact with the 
classification array and not names('method') like the other similar 
methods).

validate_name() removed because it just returns 1.

validate_species_name() removed because species() can (should) now 
contain the real species name, like 'Homo sapiens', not 'sapiens'. But 
it could also be any wonderfully complex thing, so there's nothing we 
can confidently check for as being 'correct'.


t/Taxonomy.t
------------

Runs a slightly more comprehensive set of tests on entrez, which are now 
only skipped if data retrieval fails.

Tests flatfile on a cut-down version of the taxdump.


> I'll also fix the problem with node names for ranks species and lower, 
> as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, 
> subspecies/variant names', in the way I suggested there.

This hasn't been done per se, because we now store the real 
ScientificName so there is no 'mishandling' to fix.



More information about the Bioperl-l mailing list