[Bioperl-l] Bio::Taxonomy confusion
Jason Stajich
jason.stajich at duke.edu
Thu May 11 15:42:07 UTC 2006
I think you'll see it is different and mostly a limitation of the
genbank format and the Bio::Species objects that you get from a
genbank parse do represent the full capabilities of a Taxonomy::Node.
I am happy for someone to overhaul things, but it all boils down to
inferring which part of a list of names is the species versus sub-
species versus strain when none of the members of the list are
labeled. This is some of the same problems we have for swissprot as
well. I just don't think we can do it right only from the genbank
file data so I don't see a lot of point of expecting Bio::Species to
provide more than a representation of what is in the file and just
return that array.
It has seemed like we need to special case things pretty heavily or
do a lookup in the taxonomydb for something.
Can you guess what value is the strain versus sub-species? What
happens when there is a two part strain name (space separated) and a
sub-species or variety designation?
SOURCE Staphylococcus haemolyticus JCSC1435
ORGANISM Staphylococcus haemolyticus JCSC1435
Bacteria; Firmicutes; Bacillales; Staphylococcus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=279808
strain is JCSC1435
versus
SOURCE Muntiacus muntjak vaginalis
ORGANISM Muntiacus muntjak vaginalis
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla;
Ruminantia;
Pecora; Cervidae; Muntiacinae; Muntiacus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9887
species is muntjak, sub-species vaginalis ?
versus
SOURCE Aspergillus nidulans FGSC A4
ORGANISM Aspergillus nidulans FGSC A4
Eukaryota; Fungi; Ascomycota; Pezizomycotina;
Eurotiomycetes;
Eurotiales; Trichocomaceae; Emericella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=227321
Genus should be Aspergillus or Emericella ?
Strain and subspecies/variety in the same entry
SOURCE Cryptococcus neoformans var. grubii H99
ORGANISM Cryptococcus neoformans var. grubii H99
Eukaryota; Fungi; Basidiomycota; Hymenomycetes;
Heterobasidiomycetes; Tremellomycetidae; Tremellales;
Tremellaceae;
Filobasidiella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=235443
On May 11, 2006, at 10:57 AM, Chris Fields wrote:
> Heh...
>
> To tell the truth, I haven't looked at Bio::DB::Taxonomy in any
> depth yet,
> but I myself have seen issues with the way Bio::Species treats
> bacterial
> strains (I guess this also involves Bio::Taxonomy::Node since
> that's what
> Bio::Species delegates to). Seems it likes to repeat some strain
> names when
> using $seq->species->common_name. Not a killer problem but
> annoying since
> the correct name is in the source tag in the feature table! I
> 'could' take
> a look at it but I can't guarantee quick results.
>
> Jason, I could add Taxonomy to the EUtilities overhaul I mentioned
> to you
> previously but it'll take awhile to get going. I'm really more
> interested
> in getting epost-esearch-efetch sequence retrieval up and running
> first with
> the same API as Bio::DB::GenBank/Genpept and
> Bio::DB::Query::GenBank, donate
> the code (late summer/fall???) after working out namespace issues
> so it
> doesn't conflict with current Bio::DB::WebDBSeqI inheritance. I
> suppose I
> could also look at Bio::DB:Taxonomy to see what's up in the next
> couple of
> weeks (after conference), unless someone gets to it sooner.
>
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
>> Sent: Thursday, May 11, 2006 7:05 AM
>> To: Chris Fields
>> Cc: bioperl-l at lists.open-bio.org; 'Sendu Bala'
>> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
>>
>> Great - now we just need someone to volunteer to actually work on
>> this.
>>
>> The current code grabs most of this but I believe expects a different
>> XML
>>
>>
>> On May 10, 2006, at 11:36 PM, Chris Fields wrote:
>>
>>> I think you can get pretty much everything now, though I can
>>> definitely see
>>> the use of a local database. I ran a few tests, really unrelated
>>> to this,
>>> using the powerscripting test page at NCBI for eutils (for the
>>> curious, at
>>> http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/eu.cgi) and was
>>> able to
>>> retrieve XML-formatted taxonomic information; here's the bacterium
>>> Frankia
>>> sp. CcI3 TaxID info, which looks like they have everything set up
>>> by rank.
>>> It gives quite a bit of information.
>>>
>>> <?xml version="1.0"?>
>>> <!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN"
>>> "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
>>> <TaxaSet>
>>>
>>> <Taxon>
>>> <TaxId>106370</TaxId>
>>> <ScientificName>Frankia sp. CcI3</ScientificName>
>>> <ParentTaxId>1854</ParentTaxId>
>>> <Rank>species</Rank>
>>> <Division>Bacteria</Division>
>>> <GeneticCode>
>>> <GCId>11</GCId>
>>> <GCName>Bacterial and Plant Plastid</GCName>
>>> </GeneticCode>
>>> <MitoGeneticCode>
>>> <MGCId>0</MGCId>
>>> <MGCName>Unspecified</MGCName>
>>> </MitoGeneticCode>
>>> <Lineage>cellular organisms; Bacteria; Actinobacteria;
>>> Actinobacteria
>>> (class); Actinobacteridae; Actinomycetales; Frankineae; Frankiaceae;
>>> Frankia</Lineage>
>>> <LineageEx>
>>> <Taxon>
>>> <TaxId>131567</TaxId>
>>> <ScientificName>cellular organisms</ScientificName>
>>> <Rank>no rank</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>2</TaxId>
>>> <ScientificName>Bacteria</ScientificName>
>>> <Rank>superkingdom</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>201174</TaxId>
>>> <ScientificName>Actinobacteria</ScientificName>
>>> <Rank>phylum</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>1760</TaxId>
>>> <ScientificName>Actinobacteria (class)</ScientificName>
>>> <Rank>class</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>85003</TaxId>
>>> <ScientificName>Actinobacteridae</ScientificName>
>>> <Rank>subclass</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>2037</TaxId>
>>> <ScientificName>Actinomycetales</ScientificName>
>>> <Rank>order</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>85013</TaxId>
>>> <ScientificName>Frankineae</ScientificName>
>>> <Rank>suborder</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>74712</TaxId>
>>> <ScientificName>Frankiaceae</ScientificName>
>>> <Rank>family</Rank>
>>> </Taxon>
>>> <Taxon>
>>> <TaxId>1854</TaxId>
>>> <ScientificName>Frankia</ScientificName>
>>> <Rank>genus</Rank>
>>> </Taxon>
>>> </LineageEx>
>>> <CreateDate>1999/10/22</CreateDate>
>>> <UpdateDate>2005/01/19</UpdateDate>
>>> <PubDate>2000/02/02</PubDate>
>>> </Taxon>
>>>
>>>
>>> Chris
>>>
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
>>>> Sent: Wednesday, May 10, 2006 7:54 PM
>>>> To: Sendu Bala
>>>> Cc: bioperl-l at lists.open-bio.org
>>>> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
>>>>
>>>> I would use the implementation that talks to the flatfile db as the
>>>> standard here. nodes are defined by the data in from taxonomy dump
>>>> dbs from ncbi.
>>>> the eutils is pretty worthless except for taxid->name or
>>>> reverse, you
>>>> can't get the full taxonomy (or couldn't when that
>>>> implementation was
>>>> written).
>>>>
>>>> The "name" method refers to the name of the node - each level in
>>>> the
>>>> taxonomy can have a "name".
>>>>
>>>> The bits of hackiness relate to wrapping the node object as a
>>>> Bio::Species and/or being able to read a genbank file and the
>>>> organism taxonomy data as a list and instantiating. If we could
>>>> rely
>>>> on everything being in a DB of course this would be simpler.
>>>>
>>>> Another problem is the depth of the taxonomy is not constant for
>>>> every node so assuming that a fixed number of slots will be
>>>> filled in
>>>> to generate the taxonomy leads to problems.
>>>>
>>>> Use the flatfile implementation (Bio::DB::Taxonomy::flatfile) as
>>>> the
>>>> best example of working code as this is how I really wanted it to
>>>> work, the Bio::Species hacks are only there to shoehorn data
>>>> retrieved from genbank files in. With the flatfile implementation
>>>> you have to walk all the way up the db hierarchy to get the kingdom
>>>> for a node so you do have to build up the classification
>>>> hierarchy as
>>>> each node only stores data about itsself.
>>>>
>>>> I'm not exactly sure what you are proposing to do, but would
>>>> definitely enjoy another pair of hands, I don't really have time to
>>>> mess with it any time soon.
>>>>
>>>> -jason
>>>> On May 10, 2006, at 5:30 AM, Sendu Bala wrote:
>>>>
>>>>> Hi,
>>>>> I'm a little confused as to how names are supposed to work in
>>>>> Bio::Taxonomy::Node.
>>>>>
>>>>> In the bioperl versions that I've looked at a Node doesn't seem to
>>>>> store
>>>>> the most important information about itself - it's scientific name
>>>>> - in
>>>>> an obvious place. bioperl 1.5.1 puts it at the start of the
>>>>> classification list. I'd have thought sticking it in -name would
>>>>> make
>>>>> more sense, but this is used only for the GenBank common name.
>>>>>
>>>>> The Bio::Taxonomy docs still suggests:
>>>>>
>>>>> my $node_species_sapiens = Bio::Taxonomy::Node->new(
>>>>> -object_id => 9606, # or -ncbi_taxid. Requird tag
>>>>> -names => {
>>>>> 'scientific' => ['sapiens'],
>>>>> 'common_name' => ['human']
>>>>> },
>>>>> -rank => 'species' # Required tag
>>>>> );
>>>>>
>>>>> and whilst Bio::Taxonomy::Node does not accept -names, it does
>>>>> have a
>>>>> 'name' method which claims to work like:
>>>>>
>>>>> $obj->name('scientific', 'sapiens');
>>>>>
>>>>> This kind of thing would be really nice, but afaics
>>>>> Bio::Taxonomy::Node->new takes the -name value and makes a common
>>>>> name
>>>>> out of it, whilst the name() method passes any 'scientific'
>>>>> name to
>>>>> the
>>>>> scientific_name() method which is unable to set any value (and
>>>>> warns
>>>>> about this), only get.
>>>>>
>>>>> It seems like the need to have this classification array work the
>>>>> same
>>>>> way as Bio::Species is causing some unnecessary restrictions.
>>>>> Can't
>>>>> the
>>>>> more sensible idea of having a dedicated storage spot for the
>>>>> ScientificName and other parameters be used, with the
>>>>> classification
>>>>> array either being generated just-in-time from the hash-stored
>>>>> data, or
>>>>> indeed being generated from the Lineage field?
>>>>>
>>>>>
>>>>> Also, why does a node store the complete hierarchy on itself in
>>>>> the
>>>>> classification array? If we're going that far, why don't the
>>>>> Bio::DB::Taxonomy modules like Bio::DB::Taxonomy::entrez just
>>>>> have a
>>>>> get_taxonomy() method instead of a get_Taxonomy_Node() method.
>>>>> get_taxonomy() could, from a single efetch.fcgi lookup, create a
>>>>> complete Bio::Taxonomy with all the nodes. Whilst most nodes would
>>>>> only
>>>>> have a minimum of information, if you could simply ask a node
>>>>> what its
>>>>> rank and scientific name was you could easily build a
>>>>> classification
>>>>> array, or ask what Kingdom your species was in etc.
>>>>>
>>>>> Are there good reasons for Taxonomy working the way it does in
>>>>> 1.5.1, or
>>>>> would I not be wasting my time re-writing things to make more
>>>>> sense
>>>>> (to me)?
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Sendu.
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> --
>>>> Jason Stajich
>>>> Duke University
>>>> http://www.duke.edu/~jes12
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> Duke University
>> http://www.duke.edu/~jes12
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
http://www.duke.edu/~jes12
More information about the Bioperl-l
mailing list