[Bioperl-l] taxonomy and speices
Brian Osborne
brian_osborne at cognia.com
Thu Aug 28 08:49:12 EDT 2003
Juguang,
It sounds like you've formulated a solution, but let me describe another
approach, if only to plug BioSQL for those who haven't installed it. With a
BioSQL database installed one can run Aaron's load_taxononomy.pl (found in
the biosql package), this loads the current taxonomy data from NCBI. There
you'll find each taxon labeled by name ("Arabidopsis"), node_rank ("genus"),
and parent_taxon_id. Yes, this approach is a bit more "mechanical" than
yours but a straightforward script will get both the "full path" or the
children from the database. Sidelight: see
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html for Aaron's
nice article on the meaning of the right_value and left_value fields.
If you do write the code you've suggested please send the final script, it
sounds like a good one for our examples/ directory.
Brian O.
-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Juguang Xiao
Sent: Thursday, August 28, 2003 5:55 AM
To: bioperl-l at bioperl.org
Subject: [Bioperl-l] taxonomy and speices
Hi guys,
I tried to write a simple bioperl-db scripts functioning like the
search on http://www.ncbi.nih.gov/Taxonomy/taxonomyhome.html/ , to
return a full taxonomy path, and all sub taxonomy nodes. Say, If I
search 'mouse', it will return the full path as
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus; mouse
And all sub taxonomy nodes will be also returned, like 'asian house
mouse', 'european house mouse', etc.
However, the Guru Hilmar told me that current bioperl-db works on
Bio::Species, but not Bio::Taxonomy, and now bioperl-db cannot satisfy
my above requirement until the code will adapt Taxomony after Taxonomy
replaces Species. Hence I investigate the species-related modules,
found some puzzles and would like to volunteer the idea and the code.
Bio::Taxonomy is written by Dan Kortschak, and the main and only
functional method (rather than get/set, I mean), 'classify', is to
convert a Species object into an array of names. It wastes such nice
module name ;-)
Jason wrote Bio::Taxonomy::Node, and Bio::DB::Taxonomy which access
NCBI Entrez over HTTP OR read the NCBI Tax dump files.
Bio::Taxonomy::Node is tied to Bio::DB::Taxonomy closely, hence it
objects to be adapted in bioperl-db system so easily.
My plan to reform them is described below.
DATA STRUCTURE
Taxonomy should be abstracted as a hash with the keys as rank names,
such as 'class', 'genus', and values as the identifiers, such as NCBI
taxid, scientific name or Taxonomy::Node object.
$taxonomy = {
'_rank' => ['root', 'superkingdom', ..., 'species', 'subspecies'...,
'no rank'], # copied from the current Taxonomy module.
'_hierarchy' => { # Though the keys are unordered in this hash, its
order is defined in rank.
...
'class' => 40674, # or mammlia, or the Taxonomy::Node
'genus' => 'Mus',
'species' => $tax_Node_musculus
....
},
'_factory' => $factory, # explained later.
};
NOTE: the new taxonomy can represent more than species level, e.g. it
is flexible to represent a object at genus level without species.
$taxNode_mammalia = {
'object_id' => 40674, # NCBI taxid, and the reason why it is called
'object_id' for the consistence to Bio:;IdentifiableI
'rank' => 'class',
'name' => 'Mammalia', # scientific name
'common_name' => 'mammals', # Genbank common name, as NCBI site uses
the term.
'alias' => { # a hash with name_class as key and variant name as
value
'' => ''
},
'_factory' => $factory
};
$taxNode_mouse = {
'object_id' => 10090,
'rank' => 'species',
'names' => { # This is a general solution!!
'specific' => ['musclus'],
'common' => ['mouse', 'Mickey'],
'includes' => ['nude mice']
}
};
OBJECTS
Bio::Taxonomy will override all methods in Bio::Species, for the sake
of backwards compatibility. If the tax object represents a level higher
than species, the sub 'binomial' returns undef, otherwise simple make
the result by combining the species and genus; the sub 'classification'
will look like "
foreach(@ranks){
unshift @classification, $taxonomy{$_} if defined exists
$taxonomy{$_}
}
Bio::Taxonomy::Node has NO reference to either the parent node or
taxonomy object, so that Node objects can be freely shared among
Taxonomy. Tricky: once a Node object is created, it should be changed
on its content. If a Taxonomy requires one of its Nodes modified, it
has to make a new Node, in case that Node was shared by other Taxonomy.
Definitely, we need a Taxonomy factory, like Jason's Bio::DB::Taxonomy
or what we are going to create in bioperl-db. Both Taxonomy objects and
Node ones have a reference to this factory, so that Taxonomy can be
created automatically, and Node can ask who his parent is,
($node->get_parent_node, e. g.
$node->_factory->find_parent_node($node)).
Comments, please, and I will transform the idea into the code.
Thanks.
Juguang
------------ATGCCGAGCTTNNNNCT--------------
Juguang Xiao
Bioinformatics Engineer
Temasek Life Sciences Laboratory, National University of Singapore
1 Research Link, Singapore 117604
fax: (+65) 68727007
juguang at tll.org.sg
_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list