[Bioperl-l] Bio::*Taxonomy* changes
Sendu Bala
bix at sendu.me.uk
Thu Jul 20 13:35:43 UTC 2006
Sendu Bala wrote:
> node 2 has name 'Bacteria <bacteria>' and rank 'superkingdom'
> node 1386 has name 'Bacillus <bacterium>' and rank 'genus'
> node 7776 has name 'Gnathostomata <vertebrate>' and rank 'superclass'
> etc.
>
> For me the bits in <> are inappropriate and shouldn't be there.
> [...]
> If there are no objections I'll strip the <> bits. I also plan to make
> $node->name('scientific', 'sapiens'); set and get the node name, and
> have flatfile and entrez store all common names with
> $obj->name('common', 'human', 'man');.
I'll describe all the changes I've now made and if no-one complains I'll
commit. (I've also made these notes into bug 2047 for easier reference
in the future.)
Bio::DB::Taxonomy::flatfile
---------------------------
# Bug-fixes
Removed invalid requirement that all species nodes have at least 7
named-rank parents.
The names->id solution used by get_taxonid() only stored that last id
associated with a name. However the name used wasn't necessarily unique,
such that multiple ids could match. names->id solution now remembers all
ids that match a name.
API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids()
and it returns an array of ids in list context. For backward
compatibility it returns one of the ids in scalar context, and
*get_taxonid = \&get_taxonids.
Added missing division ENV 'Environmental samples'.
# Improvements
Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the
common names, genetic code and mitochondrial genetic code in each node
it makes.
NOTE: entrez also stores creation, publication and update dates, but
this data is not available in the taxdump from NCBI ftp site.
NOTE: the common names are stored in no particular order; the genbank
common name in particular isn't necessarily the first in the list (cf.
old entrez.pm behaviour).
BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the
division as a three letter code, like 'PRI'. However, for consistency
with entrez and the scientific_name() of the node the division is
supposed to correspond to, it is now stored as the full name, like
'Primates'.
The names->id solution also stores the artificially uniqued names like
'Craniata <chordata>', allowing you for the first time to retrieve the
correct id. Previously the search would have simply failed completely.
The names->id solution now handles nodes with scientific names of 'xyz
(class)', allowing you to retrieve the id with both get_taxonids('xyz')
and get_taxonids('xyz (class)'). Previously only the latter would work.
NOTE: the previous 2 changes (and the issues with entrez, see below)
make flatfile better at searching the taxonomy database than entrez
module or the website, both in terms of speed and completeness of results.
BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way,
always being sent directly to Bio::Taxonomy::Node->new(-name =>
$untouched) or the $node->classification() array. Previously, a species
node would have its name converted from 'Homo sapiens' to 'sapiens', but
the conversion mangled very badly certain other species names.
Bio::DB::Taxonomy::entrez
-------------------------
# Bug-fixes
Special characters like ", ( and ) in the input query string to
get_taxonid() result in the failure or inaccuracy of the search. These
characters are now removed prior to submission, allowing for correct
search results.
API-CHANGE: entrez has always been able to return multiple ids that
match a single input name, so I've renamed get_taxonid() to
get_taxonids() and it returns an array of ids in list context. It
returns one of the ids in scalar context. For backward compatibility,
*get_taxonid = \&get_taxonids.
NOTE: entrez modules (and website) cannot cope with '<something>' in the
query, failing searches like 'Craniata <chordata>'. For this reason, if
get_taxonids() is given a query with '<something>' it will immediately
return undefined, saving a pointless website access. If you want the id
of 'Craniata <chordata>' you must search for 'Craniata', then get the
node for each returned id to see which one has a parent node with a
scientific_name() or common_names() case-insensitive matching to 'chordata'.
# Improvements
BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website.
BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/
\(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name =>
$untouched) or the $node->classification() array. Previously, a species
node would have its name converted from 'Homo sapiens' to 'sapiens', but
the conversion mangled very badly certain other species names.
BEHAVIOUR-CHANGE: all common names of a node are now stored in the
resulting Node object with Bio::Taxonomy::Node->new(-common_names =>
\@names). This means that the Genbank common name is now just one
amongst others, and isn't guaranteed to be the first in the list either.
Bio::Taxonomy::Node
-------------------
# Bug-fixes
non-interesting fixes to get get_Children_Nodes(), get_Lineage_Nodes()
and get_LCA_Node() to work correctly.
classification() has a proper solution to finding the classification
when the array wasn't manually set.
# Improvements
BEHAVIOUR-CHANGE: node_name() used to be an alias to name('common'). Now
it is an alias to name('scientific').
NOTE: node_name is what is set when ->new(-name => $name) is set, so
flatfile and entrez and user-created nodes now implicitly associate the
name of the node they create with its scientific name.
BEHAVIOUR-CHANGE: scientific_name() used to be an alias to binomial().
Now it is *scientific_name = \&node_name.
binomial(), in addition to working the old way (assume first two
elements of classification array are species and genus, combine them),
will shortcut and return the scientific_name() if we are a node with
rank 'species' and scientific_name is two words. This makes binomial()
an effective synonym of scientific_name() when Nodes were constructed as
per flatfile or entrez, and when it is used correctly on a species node.
BEHAVIOUR-CHANGE: *parent_taxon_id = \&parent_id. (Previously, you could
assign and retrieve different values to/from each method.)
New method common_names() supersedes common_name(), returning a list of
all common_names. For backward compatibility, returns one of the names
in scalar context, and *common_name = \&common_names.
-factory and factory() removed, since there is no
Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use
of a factory once set, and a factory seems redundant when we're a node
with a -dbh.
species() and genus() issue a warning when you try to use them on a node
that isn't of rank 'species' (since they interact with the
classification array and not names('method') like the other similar
methods).
validate_name() removed because it just returns 1.
validate_species_name() removed because species() can (should) now
contain the real species name, like 'Homo sapiens', not 'sapiens'. But
it could also be any wonderfully complex thing, so there's nothing we
can confidently check for as being 'correct'.
t/Taxonomy.t
------------
Runs a slightly more comprehensive set of tests on entrez, which are now
only skipped if data retrieval fails.
Tests flatfile on a cut-down version of the taxdump.
> I'll also fix the problem with node names for ranks species and lower,
> as discussed in thread 'Bio::DB::Taxonomy:: mishandles species,
> subspecies/variant names', in the way I suggested there.
This hasn't been done per se, because we now store the real
ScientificName so there is no 'mishandling' to fix.
More information about the Bioperl-l
mailing list