[Bioperl-l] Bio::DB::Taxonomy::entrez updated
Jason Stajich
jason.stajich at duke.edu
Tue Aug 9 13:09:31 EDT 2005
I've updated Bio::DB::Taxonomy::entrez to now fully parse out the XML
from the Efetch Eutils CGI script. Can now return a fully populated
Bio::Taxonomy::Node object, most importantly with a parent_id field
filled in. This allows the web-only implementation to work just as
the flatfile implementation does and you can walk up the taxonomy
hierarchy. There is currently no way to walk down the hierarchy
unless one can construct an Entrez query to get all the nodes which
have a particular parent. If someone knows how to do this, please
let me know.
I added a few fields to Bio::Taxonomy::Node to capture genetic_code,
pub_date, update_date, create_date, mitochondrial_genetic_code from
the database entry.
At this point I think we can think about retiring Bio::Species and
replace it with Bio::Taxonomy::Node. I would probably just make
Bio::Species delegate Bio::Taxonomy::Node or maybe someone can think
of something more clever. There will be a bit of fiddling under the
hood to make this really work, but I think it can be done for the 1.6
release and still be transparent to the user (i.e. API is completely
retained for Bio::Seq->species, Bio::Species, etc however new
functionality is now also available).
Here is how you can use the DB interface:
use Bio::DB::Taxonomy;
my $db = new Bio::DB::Taxonomy(-source => 'entrez');
my $taxonid = $db->get_taxonid('Homo sapiens');
my $node = $db->get_Taxonomy_Node(-taxonid => $taxonid);
print $node->binomial, "\n";
I added a script in scripts/taxa/query_entrez_taxa.PLS which
demonstrates how to use it as well.
Where I find this modules useful is parsing a Search Result report
and classifying hits by taxonomy. Given a gi numbers in the search
result (BLAST, FASTA, SSEARCH hits), getting the taxaid for the GI is
just one step away now.
I added a capability to the API in Bio::DB::Taxonomy::entrez for
retrieving taxonomy info based on a GI number. You can pass in the -
gi => $ginumber option to the get_Taxonomy_Node.
Demonstration of use here:
my $gi = 71836523;
my $node = $db->get_Taxonomy_Node(-gi => $gi, -db => 'protein');
print $node->binomial, "\n";
my ($species,$genus,$family) = $node->classification;
print "family is $family\n";
# Can also go up 4 levels
my $p = $node;
for ( 1..4 ) {
$p = $db->get_Taxonomy_Node(-taxonid => $p->parent_id);
}
print $p->rank, " ", ($p->classification)[0], "\n";
# could then classify a set of BLAST hits based on their GI numbers
# into taxonomic categories.
I have tried to put these examples in the SYNOPSIS, t/Taxonomy.t and
the script in scripts/taxa/query_entrez_taxa.PLS. If there are
mistakes or typos, or something is unclear, please let us know and it
can updated. I hope a section describing how to use these in
SearchIO context (parsing reports) can be added when I have time.
Best,
-jason
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/
More information about the Bioperl-l
mailing list