[Bioperl-l] Bio::Taxonomy confusion

Thu May 11 14:57:20 UTC 2006

Heh... 

To tell the truth, I haven't looked at Bio::DB::Taxonomy in any depth yet,
but I myself have seen issues with the way Bio::Species treats bacterial
strains (I guess this also involves Bio::Taxonomy::Node since that's what
Bio::Species delegates to).  Seems it likes to repeat some strain names when
using $seq->species->common_name.  Not a killer problem but annoying since
the correct name is in the source tag in the feature table!  I 'could' take
a look at it but I can't guarantee quick results.

Jason, I could add Taxonomy to the EUtilities overhaul I mentioned to you
previously but it'll take awhile to get going.  I'm really more interested
in getting epost-esearch-efetch sequence retrieval up and running first with
the same API as Bio::DB::GenBank/Genpept and Bio::DB::Query::GenBank, donate
the code (late summer/fall???) after working out namespace issues so it
doesn't conflict with current Bio::DB::WebDBSeqI inheritance.  I suppose I
could also look at Bio::DB:Taxonomy to see what's up in the next couple of
weeks (after conference), unless someone gets to it sooner.

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
> Sent: Thursday, May 11, 2006 7:05 AM
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org; 'Sendu Bala'
> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
> 
> Great - now we just need someone to volunteer to actually work on this.
> 
> The current code grabs most of this but I believe expects a different
> XML
> 
> 
> On May 10, 2006, at 11:36 PM, Chris Fields wrote:
> 
> > I think you can get pretty much everything now, though I can
> > definitely see
> > the use of a local database.  I ran a few tests, really unrelated
> > to this,
> > using the powerscripting test page at NCBI for eutils (for the
> > curious, at
> > http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/eu.cgi) and was
> > able to
> > retrieve XML-formatted taxonomic information; here's the bacterium
> > Frankia
> > sp. CcI3 TaxID info, which looks like they have everything set up
> > by rank.
> > It gives quite a bit of information.
> >
> > <?xml version="1.0"?>
> > <!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN"
> > "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
> > <TaxaSet>
> >
> > <Taxon>
> >   <TaxId>106370</TaxId>
> >   <ScientificName>Frankia sp. CcI3</ScientificName>
> >   <ParentTaxId>1854</ParentTaxId>
> >   <Rank>species</Rank>
> >   <Division>Bacteria</Division>
> >   <GeneticCode>
> >     <GCId>11</GCId>
> >     <GCName>Bacterial and Plant Plastid</GCName>
> >   </GeneticCode>
> >   <MitoGeneticCode>
> >     <MGCId>0</MGCId>
> >     <MGCName>Unspecified</MGCName>
> >   </MitoGeneticCode>
> >   <Lineage>cellular organisms; Bacteria; Actinobacteria;
> > Actinobacteria
> > (class); Actinobacteridae; Actinomycetales; Frankineae; Frankiaceae;
> > Frankia</Lineage>
> >   <LineageEx>
> >     <Taxon>
> >       <TaxId>131567</TaxId>
> >       <ScientificName>cellular organisms</ScientificName>
> >       <Rank>no rank</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>2</TaxId>
> >       <ScientificName>Bacteria</ScientificName>
> >       <Rank>superkingdom</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>201174</TaxId>
> >       <ScientificName>Actinobacteria</ScientificName>
> >       <Rank>phylum</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>1760</TaxId>
> >       <ScientificName>Actinobacteria (class)</ScientificName>
> >       <Rank>class</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>85003</TaxId>
> >       <ScientificName>Actinobacteridae</ScientificName>
> >       <Rank>subclass</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>2037</TaxId>
> >       <ScientificName>Actinomycetales</ScientificName>
> >       <Rank>order</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>85013</TaxId>
> >       <ScientificName>Frankineae</ScientificName>
> >       <Rank>suborder</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>74712</TaxId>
> >       <ScientificName>Frankiaceae</ScientificName>
> >       <Rank>family</Rank>
> >     </Taxon>
> >     <Taxon>
> >       <TaxId>1854</TaxId>
> >       <ScientificName>Frankia</ScientificName>
> >       <Rank>genus</Rank>
> >     </Taxon>
> >   </LineageEx>
> >   <CreateDate>1999/10/22</CreateDate>
> >   <UpdateDate>2005/01/19</UpdateDate>
> >   <PubDate>2000/02/02</PubDate>
> > </Taxon>
> >
> >
> > Chris
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Jason Stajich
> >> Sent: Wednesday, May 10, 2006 7:54 PM
> >> To: Sendu Bala
> >> Cc: bioperl-l at lists.open-bio.org
> >> Subject: Re: [Bioperl-l] Bio::Taxonomy confusion
> >>
> >> I would use the implementation that talks to the flatfile db as the
> >> standard here.  nodes are defined by the data in from taxonomy dump
> >> dbs from ncbi.
> >> the eutils is pretty worthless except for taxid->name or reverse, you
> >> can't get the full taxonomy (or couldn't when that implementation was
> >> written).
> >>
> >> The "name" method refers to the name of the node - each level in the
> >> taxonomy can have a "name".
> >>
> >> The bits of hackiness relate to wrapping the node object as a
> >> Bio::Species and/or being able to read  a genbank file and the
> >> organism taxonomy data as a list and instantiating.  If we could rely
> >> on everything being in a DB of course this would be simpler.
> >>
> >> Another problem is the depth of the taxonomy is not constant for
> >> every node so assuming that a fixed number of slots will be filled in
> >> to generate the taxonomy leads to problems.
> >>
> >> Use the flatfile implementation (Bio::DB::Taxonomy::flatfile) as the
> >> best example of working code as this is how I really wanted it to
> >> work, the Bio::Species hacks are only there to shoehorn data
> >> retrieved from genbank files in.  With the flatfile implementation
> >> you have to walk all the way up the db hierarchy to get the kingdom
> >> for a node so you do have to build up the classification hierarchy as
> >> each node only stores data about itsself.
> >>
> >> I'm not exactly sure what you are proposing to do, but would
> >> definitely enjoy another pair of hands, I don't really have time to
> >> mess with it any time soon.
> >>
> >> -jason
> >> On May 10, 2006, at 5:30 AM, Sendu Bala wrote:
> >>
> >>> Hi,
> >>> I'm a little confused as to how names are supposed to work in
> >>> Bio::Taxonomy::Node.
> >>>
> >>> In the bioperl versions that I've looked at a Node doesn't seem to
> >>> store
> >>> the most important information about itself - it's scientific name
> >>> - in
> >>> an obvious place. bioperl 1.5.1 puts it at the start of the
> >>> classification list. I'd have thought sticking it in -name would
> >>> make
> >>> more sense, but this is used only for the GenBank common name.
> >>>
> >>> The Bio::Taxonomy docs still suggests:
> >>>
> >>> my $node_species_sapiens = Bio::Taxonomy::Node->new(
> >>>    -object_id => 9606, # or -ncbi_taxid. Requird tag
> >>>    -names => {
> >>>        'scientific' => ['sapiens'],
> >>>        'common_name' => ['human']
> >>>    },
> >>>    -rank => 'species'  # Required tag
> >>> );
> >>>
> >>> and whilst Bio::Taxonomy::Node does not accept -names, it does
> >>> have a
> >>> 'name' method which claims to work like:
> >>>
> >>> $obj->name('scientific', 'sapiens');
> >>>
> >>> This kind of thing would be really nice, but afaics
> >>> Bio::Taxonomy::Node->new takes the -name value and makes a common
> >>> name
> >>> out of it, whilst the name() method passes any 'scientific' name to
> >>> the
> >>> scientific_name() method which is unable to set any value (and warns
> >>> about this), only get.
> >>>
> >>> It seems like the need to have this classification array work the
> >>> same
> >>> way as Bio::Species is causing some unnecessary restrictions. Can't
> >>> the
> >>> more sensible idea of having a dedicated storage spot for the
> >>> ScientificName and other parameters be used, with the classification
> >>> array either being generated just-in-time from the hash-stored
> >>> data, or
> >>> indeed being generated from the Lineage field?
> >>>
> >>>
> >>> Also, why does a node store the complete hierarchy on itself in the
> >>> classification array? If we're going that far, why don't the
> >>> Bio::DB::Taxonomy modules like Bio::DB::Taxonomy::entrez just have a
> >>> get_taxonomy() method instead of a get_Taxonomy_Node() method.
> >>> get_taxonomy() could, from a single efetch.fcgi lookup, create a
> >>> complete Bio::Taxonomy with all the nodes. Whilst most nodes would
> >>> only
> >>> have a minimum of information, if you could simply ask a node
> >>> what its
> >>> rank and scientific name was you could easily build a classification
> >>> array, or ask what Kingdom your species was in etc.
> >>>
> >>> Are there good reasons for Taxonomy working the way it does in
> >>> 1.5.1, or
> >>> would I not be wasting my time re-writing things to make more sense
> >>> (to me)?
> >>>
> >>>
> >>> Cheers,
> >>> Sendu.
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> --
> >> Jason Stajich
> >> Duke University
> >> http://www.duke.edu/~jes12
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l