[Bioperl-l] Bio::Taxonomy confusion

Thu May 11 15:42:07 UTC 2006

I think you'll see it is different and mostly a limitation of the  
genbank format and the Bio::Species objects that you get from a  
genbank parse do represent the full capabilities of a Taxonomy::Node.

I am happy for someone to overhaul things, but it all boils down to  
inferring which part of a list of names is the species versus sub- 
species versus strain when none of the members of the list are  
labeled.  This is some of the same problems we have for swissprot as  
well.  I just don't think we can do it right only from the genbank  
file data so I don't see a lot of point of expecting Bio::Species to  
provide more than a representation of what is in the file and just  
return that array.

It has seemed like we need to special case things pretty heavily or  
do a lookup in the taxonomydb for something.

Can you guess what value is the strain versus sub-species?  What  
happens when there is a two part strain name (space separated) and a  
sub-species or variety designation?

SOURCE      Staphylococcus haemolyticus JCSC1435
   ORGANISM  Staphylococcus haemolyticus JCSC1435
             Bacteria; Firmicutes; Bacillales; Staphylococcus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=279808
strain is JCSC1435

versus
SOURCE      Muntiacus muntjak vaginalis
   ORGANISM  Muntiacus muntjak vaginalis
             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;  
Euteleostomi;
             Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla;  
Ruminantia;
             Pecora; Cervidae; Muntiacinae; Muntiacus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9887
species is muntjak, sub-species vaginalis ?

versus
SOURCE      Aspergillus nidulans FGSC A4
   ORGANISM  Aspergillus nidulans FGSC A4
             Eukaryota; Fungi; Ascomycota; Pezizomycotina;  
Eurotiomycetes;
             Eurotiales; Trichocomaceae; Emericella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=227321

Genus should be Aspergillus or Emericella ?

Strain and subspecies/variety in the same entry
SOURCE      Cryptococcus neoformans var. grubii H99
   ORGANISM  Cryptococcus neoformans var. grubii H99
             Eukaryota; Fungi; Basidiomycota; Hymenomycetes;
             Heterobasidiomycetes; Tremellomycetidae; Tremellales;  
Tremellaceae;
             Filobasidiella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=235443

On May 11, 2006, at 10:57 AM, Chris Fields wrote:

> Heh...
>
> To tell the truth, I haven't looked at Bio::DB::Taxonomy in any  
> depth yet,
> but I myself have seen issues with the way Bio::Species treats  
> bacterial
> strains (I guess this also involves Bio::Taxonomy::Node since  
> that's what
> Bio::Species delegates to).  Seems it likes to repeat some strain  
> names when
> using $seq->species->common_name.  Not a killer problem but  
> annoying since
> the correct name is in the source tag in the feature table!  I  
> 'could' take
> a look at it but I can't guarantee quick results.
>
> Jason, I could add Taxonomy to the EUtilities overhaul I mentioned  
> to you
> previously but it'll take awhile to get going.  I'm really more  
> interested
> in getting epost-esearch-efetch sequence retrieval up and running  
> first with
> the same API as Bio::DB::GenBank/Genpept and  
> Bio::DB::Query::GenBank, donate
> the code (late summer/fall???) after working out namespace issues  
> so it
> doesn't conflict with current Bio::DB::WebDBSeqI inheritance.  I  
> suppose I
> could also look at Bio::DB:Taxonomy to see what's up in the next  
> couple of
> weeks (after conference), unless someone gets to it sooner.
>
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
>> Sent: Thursday, May 11, 2006 7:05 AM
>> To: Chris Fields
>> Cc: bioperl-l at lists.open-bio.org; 'Sendu Bala'
>> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
>>
>> Great - now we just need someone to volunteer to actually work on  
>> this.
>>
>> The current code grabs most of this but I believe expects a different
>> XML
>>
>>
>> On May 10, 2006, at 11:36 PM, Chris Fields wrote:
>>
>>> I think you can get pretty much everything now, though I can
>>> definitely see
>>> the use of a local database.  I ran a few tests, really unrelated
>>> to this,
>>> using the powerscripting test page at NCBI for eutils (for the
>>> curious, at
>>> http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/eu.cgi) and was
>>> able to
>>> retrieve XML-formatted taxonomic information; here's the bacterium
>>> Frankia
>>> sp. CcI3 TaxID info, which looks like they have everything set up
>>> by rank.
>>> It gives quite a bit of information.
>>>
>>> <?xml version="1.0"?>
>>> <!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN"
>>> "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
>>> <TaxaSet>
>>>
>>> <Taxon>
>>>   <TaxId>106370</TaxId>
>>>   <ScientificName>Frankia sp. CcI3</ScientificName>
>>>   <ParentTaxId>1854</ParentTaxId>
>>>   <Rank>species</Rank>
>>>   <Division>Bacteria</Division>
>>>   <GeneticCode>
>>>     <GCId>11</GCId>
>>>     <GCName>Bacterial and Plant Plastid</GCName>
>>>   </GeneticCode>
>>>   <MitoGeneticCode>
>>>     <MGCId>0</MGCId>
>>>     <MGCName>Unspecified</MGCName>
>>>   </MitoGeneticCode>
>>>   <Lineage>cellular organisms; Bacteria; Actinobacteria;
>>> Actinobacteria
>>> (class); Actinobacteridae; Actinomycetales; Frankineae; Frankiaceae;
>>> Frankia</Lineage>
>>>   <LineageEx>
>>>     <Taxon>
>>>       <TaxId>131567</TaxId>
>>>       <ScientificName>cellular organisms</ScientificName>
>>>       <Rank>no rank</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>2</TaxId>
>>>       <ScientificName>Bacteria</ScientificName>
>>>       <Rank>superkingdom</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>201174</TaxId>
>>>       <ScientificName>Actinobacteria</ScientificName>
>>>       <Rank>phylum</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>1760</TaxId>
>>>       <ScientificName>Actinobacteria (class)</ScientificName>
>>>       <Rank>class</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>85003</TaxId>
>>>       <ScientificName>Actinobacteridae</ScientificName>
>>>       <Rank>subclass</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>2037</TaxId>
>>>       <ScientificName>Actinomycetales</ScientificName>
>>>       <Rank>order</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>85013</TaxId>
>>>       <ScientificName>Frankineae</ScientificName>
>>>       <Rank>suborder</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>74712</TaxId>
>>>       <ScientificName>Frankiaceae</ScientificName>
>>>       <Rank>family</Rank>
>>>     </Taxon>
>>>     <Taxon>
>>>       <TaxId>1854</TaxId>
>>>       <ScientificName>Frankia</ScientificName>
>>>       <Rank>genus</Rank>
>>>     </Taxon>
>>>   </LineageEx>
>>>   <CreateDate>1999/10/22</CreateDate>
>>>   <UpdateDate>2005/01/19</UpdateDate>
>>>   <PubDate>2000/02/02</PubDate>
>>> </Taxon>
>>>
>>>
>>> Chris
>>>
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
>>>> Sent: Wednesday, May 10, 2006 7:54 PM
>>>> To: Sendu Bala
>>>> Cc: bioperl-l at lists.open-bio.org
>>>> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
>>>>
>>>> I would use the implementation that talks to the flatfile db as the
>>>> standard here.  nodes are defined by the data in from taxonomy dump
>>>> dbs from ncbi.
>>>> the eutils is pretty worthless except for taxid->name or  
>>>> reverse, you
>>>> can't get the full taxonomy (or couldn't when that  
>>>> implementation was
>>>> written).
>>>>
>>>> The "name" method refers to the name of the node - each level in  
>>>> the
>>>> taxonomy can have a "name".
>>>>
>>>> The bits of hackiness relate to wrapping the node object as a
>>>> Bio::Species and/or being able to read  a genbank file and the
>>>> organism taxonomy data as a list and instantiating.  If we could  
>>>> rely
>>>> on everything being in a DB of course this would be simpler.
>>>>
>>>> Another problem is the depth of the taxonomy is not constant for
>>>> every node so assuming that a fixed number of slots will be  
>>>> filled in
>>>> to generate the taxonomy leads to problems.
>>>>
>>>> Use the flatfile implementation (Bio::DB::Taxonomy::flatfile) as  
>>>> the
>>>> best example of working code as this is how I really wanted it to
>>>> work, the Bio::Species hacks are only there to shoehorn data
>>>> retrieved from genbank files in.  With the flatfile implementation
>>>> you have to walk all the way up the db hierarchy to get the kingdom
>>>> for a node so you do have to build up the classification  
>>>> hierarchy as
>>>> each node only stores data about itsself.
>>>>
>>>> I'm not exactly sure what you are proposing to do, but would
>>>> definitely enjoy another pair of hands, I don't really have time to
>>>> mess with it any time soon.
>>>>
>>>> -jason
>>>> On May 10, 2006, at 5:30 AM, Sendu Bala wrote:
>>>>
>>>>> Hi,
>>>>> I'm a little confused as to how names are supposed to work in
>>>>> Bio::Taxonomy::Node.
>>>>>
>>>>> In the bioperl versions that I've looked at a Node doesn't seem to
>>>>> store
>>>>> the most important information about itself - it's scientific name
>>>>> - in
>>>>> an obvious place. bioperl 1.5.1 puts it at the start of the
>>>>> classification list. I'd have thought sticking it in -name would
>>>>> make
>>>>> more sense, but this is used only for the GenBank common name.
>>>>>
>>>>> The Bio::Taxonomy docs still suggests:
>>>>>
>>>>> my $node_species_sapiens = Bio::Taxonomy::Node->new(
>>>>>    -object_id => 9606, # or -ncbi_taxid. Requird tag
>>>>>    -names => {
>>>>>        'scientific' => ['sapiens'],
>>>>>        'common_name' => ['human']
>>>>>    },
>>>>>    -rank => 'species'  # Required tag
>>>>> );
>>>>>
>>>>> and whilst Bio::Taxonomy::Node does not accept -names, it does
>>>>> have a
>>>>> 'name' method which claims to work like:
>>>>>
>>>>> $obj->name('scientific', 'sapiens');
>>>>>
>>>>> This kind of thing would be really nice, but afaics
>>>>> Bio::Taxonomy::Node->new takes the -name value and makes a common
>>>>> name
>>>>> out of it, whilst the name() method passes any 'scientific'  
>>>>> name to
>>>>> the
>>>>> scientific_name() method which is unable to set any value (and  
>>>>> warns
>>>>> about this), only get.
>>>>>
>>>>> It seems like the need to have this classification array work the
>>>>> same
>>>>> way as Bio::Species is causing some unnecessary restrictions.  
>>>>> Can't
>>>>> the
>>>>> more sensible idea of having a dedicated storage spot for the
>>>>> ScientificName and other parameters be used, with the  
>>>>> classification
>>>>> array either being generated just-in-time from the hash-stored
>>>>> data, or
>>>>> indeed being generated from the Lineage field?
>>>>>
>>>>>
>>>>> Also, why does a node store the complete hierarchy on itself in  
>>>>> the
>>>>> classification array? If we're going that far, why don't the
>>>>> Bio::DB::Taxonomy modules like Bio::DB::Taxonomy::entrez just  
>>>>> have a
>>>>> get_taxonomy() method instead of a get_Taxonomy_Node() method.
>>>>> get_taxonomy() could, from a single efetch.fcgi lookup, create a
>>>>> complete Bio::Taxonomy with all the nodes. Whilst most nodes would
>>>>> only
>>>>> have a minimum of information, if you could simply ask a node
>>>>> what its
>>>>> rank and scientific name was you could easily build a  
>>>>> classification
>>>>> array, or ask what Kingdom your species was in etc.
>>>>>
>>>>> Are there good reasons for Taxonomy working the way it does in
>>>>> 1.5.1, or
>>>>> would I not be wasting my time re-writing things to make more  
>>>>> sense
>>>>> (to me)?
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Sendu.
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> --
>>>> Jason Stajich
>>>> Duke University
>>>> http://www.duke.edu/~jes12
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> Duke University
>> http://www.duke.edu/~jes12
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12