[Bioperl-l] Bio::Taxonomy confusion

Jason Stajich jason.stajich at duke.edu
Thu May 11 12:04:54 UTC 2006


Great - now we just need someone to volunteer to actually work on this.

The current code grabs most of this but I believe expects a different  
XML


On May 10, 2006, at 11:36 PM, Chris Fields wrote:

> I think you can get pretty much everything now, though I can  
> definitely see
> the use of a local database.  I ran a few tests, really unrelated  
> to this,
> using the powerscripting test page at NCBI for eutils (for the  
> curious, at
> http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/eu.cgi) and was  
> able to
> retrieve XML-formatted taxonomic information; here's the bacterium  
> Frankia
> sp. CcI3 TaxID info, which looks like they have everything set up  
> by rank.
> It gives quite a bit of information.
>
> <?xml version="1.0"?>
> <!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN"
> "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
> <TaxaSet>
>
> <Taxon>
>   <TaxId>106370</TaxId>
>   <ScientificName>Frankia sp. CcI3</ScientificName>
>   <ParentTaxId>1854</ParentTaxId>
>   <Rank>species</Rank>
>   <Division>Bacteria</Division>
>   <GeneticCode>
>     <GCId>11</GCId>
>     <GCName>Bacterial and Plant Plastid</GCName>
>   </GeneticCode>
>   <MitoGeneticCode>
>     <MGCId>0</MGCId>
>     <MGCName>Unspecified</MGCName>
>   </MitoGeneticCode>
>   <Lineage>cellular organisms; Bacteria; Actinobacteria;  
> Actinobacteria
> (class); Actinobacteridae; Actinomycetales; Frankineae; Frankiaceae;
> Frankia</Lineage>
>   <LineageEx>
>     <Taxon>
>       <TaxId>131567</TaxId>
>       <ScientificName>cellular organisms</ScientificName>
>       <Rank>no rank</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>2</TaxId>
>       <ScientificName>Bacteria</ScientificName>
>       <Rank>superkingdom</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>201174</TaxId>
>       <ScientificName>Actinobacteria</ScientificName>
>       <Rank>phylum</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>1760</TaxId>
>       <ScientificName>Actinobacteria (class)</ScientificName>
>       <Rank>class</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>85003</TaxId>
>       <ScientificName>Actinobacteridae</ScientificName>
>       <Rank>subclass</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>2037</TaxId>
>       <ScientificName>Actinomycetales</ScientificName>
>       <Rank>order</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>85013</TaxId>
>       <ScientificName>Frankineae</ScientificName>
>       <Rank>suborder</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>74712</TaxId>
>       <ScientificName>Frankiaceae</ScientificName>
>       <Rank>family</Rank>
>     </Taxon>
>     <Taxon>
>       <TaxId>1854</TaxId>
>       <ScientificName>Frankia</ScientificName>
>       <Rank>genus</Rank>
>     </Taxon>
>   </LineageEx>
>   <CreateDate>1999/10/22</CreateDate>
>   <UpdateDate>2005/01/19</UpdateDate>
>   <PubDate>2000/02/02</PubDate>
> </Taxon>
>
>
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
>> Sent: Wednesday, May 10, 2006 7:54 PM
>> To: Sendu Bala
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
>>
>> I would use the implementation that talks to the flatfile db as the
>> standard here.  nodes are defined by the data in from taxonomy dump
>> dbs from ncbi.
>> the eutils is pretty worthless except for taxid->name or reverse, you
>> can't get the full taxonomy (or couldn't when that implementation was
>> written).
>>
>> The "name" method refers to the name of the node - each level in the
>> taxonomy can have a "name".
>>
>> The bits of hackiness relate to wrapping the node object as a
>> Bio::Species and/or being able to read  a genbank file and the
>> organism taxonomy data as a list and instantiating.  If we could rely
>> on everything being in a DB of course this would be simpler.
>>
>> Another problem is the depth of the taxonomy is not constant for
>> every node so assuming that a fixed number of slots will be filled in
>> to generate the taxonomy leads to problems.
>>
>> Use the flatfile implementation (Bio::DB::Taxonomy::flatfile) as the
>> best example of working code as this is how I really wanted it to
>> work, the Bio::Species hacks are only there to shoehorn data
>> retrieved from genbank files in.  With the flatfile implementation
>> you have to walk all the way up the db hierarchy to get the kingdom
>> for a node so you do have to build up the classification hierarchy as
>> each node only stores data about itsself.
>>
>> I'm not exactly sure what you are proposing to do, but would
>> definitely enjoy another pair of hands, I don't really have time to
>> mess with it any time soon.
>>
>> -jason
>> On May 10, 2006, at 5:30 AM, Sendu Bala wrote:
>>
>>> Hi,
>>> I'm a little confused as to how names are supposed to work in
>>> Bio::Taxonomy::Node.
>>>
>>> In the bioperl versions that I've looked at a Node doesn't seem to
>>> store
>>> the most important information about itself - it's scientific name
>>> - in
>>> an obvious place. bioperl 1.5.1 puts it at the start of the
>>> classification list. I'd have thought sticking it in -name would  
>>> make
>>> more sense, but this is used only for the GenBank common name.
>>>
>>> The Bio::Taxonomy docs still suggests:
>>>
>>> my $node_species_sapiens = Bio::Taxonomy::Node->new(
>>>    -object_id => 9606, # or -ncbi_taxid. Requird tag
>>>    -names => {
>>>        'scientific' => ['sapiens'],
>>>        'common_name' => ['human']
>>>    },
>>>    -rank => 'species'  # Required tag
>>> );
>>>
>>> and whilst Bio::Taxonomy::Node does not accept -names, it does  
>>> have a
>>> 'name' method which claims to work like:
>>>
>>> $obj->name('scientific', 'sapiens');
>>>
>>> This kind of thing would be really nice, but afaics
>>> Bio::Taxonomy::Node->new takes the -name value and makes a common  
>>> name
>>> out of it, whilst the name() method passes any 'scientific' name to
>>> the
>>> scientific_name() method which is unable to set any value (and warns
>>> about this), only get.
>>>
>>> It seems like the need to have this classification array work the  
>>> same
>>> way as Bio::Species is causing some unnecessary restrictions. Can't
>>> the
>>> more sensible idea of having a dedicated storage spot for the
>>> ScientificName and other parameters be used, with the classification
>>> array either being generated just-in-time from the hash-stored
>>> data, or
>>> indeed being generated from the Lineage field?
>>>
>>>
>>> Also, why does a node store the complete hierarchy on itself in the
>>> classification array? If we're going that far, why don't the
>>> Bio::DB::Taxonomy modules like Bio::DB::Taxonomy::entrez just have a
>>> get_taxonomy() method instead of a get_Taxonomy_Node() method.
>>> get_taxonomy() could, from a single efetch.fcgi lookup, create a
>>> complete Bio::Taxonomy with all the nodes. Whilst most nodes would
>>> only
>>> have a minimum of information, if you could simply ask a node  
>>> what its
>>> rank and scientific name was you could easily build a classification
>>> array, or ask what Kingdom your species was in etc.
>>>
>>> Are there good reasons for Taxonomy working the way it does in
>>> 1.5.1, or
>>> would I not be wasting my time re-writing things to make more sense
>>> (to me)?
>>>
>>>
>>> Cheers,
>>> Sendu.
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> --
>> Jason Stajich
>> Duke University
>> http://www.duke.edu/~jes12
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12





More information about the Bioperl-l mailing list