[Bioperl-l] Bio::*Taxonomy* changes
Sendu Bala
bix at sendu.me.uk
Mon Jul 17 16:31:37 UTC 2006
I see strange node names via Bio::DB::Taxonomy::flatfile:
use Bio::DB::Taxonomy;
my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory =>
$taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile =>
$taxonomy_dir.'names.dmp');
my $tax_id = 89593;
my $node = $db->get_Taxonomy_Node($tax_id);
print "node $tax_id has name '", @{$node->name('common')}, "' and rank
'", $node->rank, "'\n";
Results in:
node 89593 has name 'Craniata <chordata>' and rank 'subphylum'
Other examples:
node 2 has name 'Bacteria <bacteria>' and rank 'superkingdom'
node 1386 has name 'Bacillus <bacterium>' and rank 'genus'
node 7776 has name 'Gnathostomata <vertebrate>' and rank 'superclass'
etc.
For me the bits in <> are inappropriate and shouldn't be there. The NCBI
website agrees, and you won't see these things if you use -source =>
'entrez'. Should they be removed by the flatfile parser as a matter of
course, with no warnings or option? Or do people want them? Typically
they are just the name of the parent node, so I don't see why anyone
would /need/ them, and I argue it's invalid for parent node information
to be duplicated here.
If there are no objections I'll strip the <> bits. I also plan to make
$node->name('scientific', 'sapiens'); set and get the node name, and
have flatfile and entrez store all common names with
$obj->name('common', 'human', 'man');. As these changes will make the
implementation match the docs I don't see any problems, except that
flatfile users will now find the node name in a different place
(@{$node->name('scientific')} instead of @{$node->name('common')}).
I'll also fix the problem with node names for ranks species and lower,
as discussed in thread 'Bio::DB::Taxonomy:: mishandles species,
subspecies/variant names', in the way I suggested there.
If anyone can see a problem with any of these changes, let me know asap.
More information about the Bioperl-l
mailing list