[Bioperl-l] taxonomy and speices
Juguang Xiao
juguang at tll.org.sg
Thu Aug 28 05:55:01 EDT 2003
Hi guys,
I tried to write a simple bioperl-db scripts functioning like the
search on http://www.ncbi.nih.gov/Taxonomy/taxonomyhome.html/ , to
return a full taxonomy path, and all sub taxonomy nodes. Say, If I
search 'mouse', it will return the full path as
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus; mouse
And all sub taxonomy nodes will be also returned, like 'asian house
mouse', 'european house mouse', etc.
However, the Guru Hilmar told me that current bioperl-db works on
Bio::Species, but not Bio::Taxonomy, and now bioperl-db cannot satisfy
my above requirement until the code will adapt Taxomony after Taxonomy
replaces Species. Hence I investigate the species-related modules,
found some puzzles and would like to volunteer the idea and the code.
Bio::Taxonomy is written by Dan Kortschak, and the main and only
functional method (rather than get/set, I mean), 'classify', is to
convert a Species object into an array of names. It wastes such nice
module name ;-)
Jason wrote Bio::Taxonomy::Node, and Bio::DB::Taxonomy which access
NCBI Entrez over HTTP OR read the NCBI Tax dump files.
Bio::Taxonomy::Node is tied to Bio::DB::Taxonomy closely, hence it
objects to be adapted in bioperl-db system so easily.
My plan to reform them is described below.
DATA STRUCTURE
Taxonomy should be abstracted as a hash with the keys as rank names,
such as 'class', 'genus', and values as the identifiers, such as NCBI
taxid, scientific name or Taxonomy::Node object.
$taxonomy = {
'_rank' => ['root', 'superkingdom', ..., 'species', 'subspecies'...,
'no rank'], # copied from the current Taxonomy module.
'_hierarchy' => { # Though the keys are unordered in this hash, its
order is defined in rank.
...
'class' => 40674, # or mammlia, or the Taxonomy::Node
'genus' => 'Mus',
'species' => $tax_Node_musculus
....
},
'_factory' => $factory, # explained later.
};
NOTE: the new taxonomy can represent more than species level, e.g. it
is flexible to represent a object at genus level without species.
$taxNode_mammalia = {
'object_id' => 40674, # NCBI taxid, and the reason why it is called
'object_id' for the consistence to Bio:;IdentifiableI
'rank' => 'class',
'name' => 'Mammalia', # scientific name
'common_name' => 'mammals', # Genbank common name, as NCBI site uses
the term.
'alias' => { # a hash with name_class as key and variant name as value
'' => ''
},
'_factory' => $factory
};
$taxNode_mouse = {
'object_id' => 10090,
'rank' => 'species',
'names' => { # This is a general solution!!
'specific' => ['musclus'],
'common' => ['mouse', 'Mickey'],
'includes' => ['nude mice']
}
};
OBJECTS
Bio::Taxonomy will override all methods in Bio::Species, for the sake
of backwards compatibility. If the tax object represents a level higher
than species, the sub 'binomial' returns undef, otherwise simple make
the result by combining the species and genus; the sub 'classification'
will look like "
foreach(@ranks){
unshift @classification, $taxonomy{$_} if defined exists $taxonomy{$_}
}
Bio::Taxonomy::Node has NO reference to either the parent node or
taxonomy object, so that Node objects can be freely shared among
Taxonomy. Tricky: once a Node object is created, it should be changed
on its content. If a Taxonomy requires one of its Nodes modified, it
has to make a new Node, in case that Node was shared by other Taxonomy.
Definitely, we need a Taxonomy factory, like Jason's Bio::DB::Taxonomy
or what we are going to create in bioperl-db. Both Taxonomy objects and
Node ones have a reference to this factory, so that Taxonomy can be
created automatically, and Node can ask who his parent is,
($node->get_parent_node, e. g.
$node->_factory->find_parent_node($node)).
Comments, please, and I will transform the idea into the code.
Thanks.
Juguang
------------ATGCCGAGCTTNNNNCT--------------
Juguang Xiao
Bioinformatics Engineer
Temasek Life Sciences Laboratory, National University of Singapore
1 Research Link, Singapore 117604
fax: (+65) 68727007
juguang at tll.org.sg
More information about the Bioperl-l
mailing list