[Bioperl-l] Bio::*Taxonomy* changes

Chris Fields cjfields at uiuc.edu
Mon Jul 24 19:49:06 UTC 2006


Yes, 'largely' the key word.  I don't really agree with Sendu's hierarchy
scheme (making Species implement Taxonomy and not Node doesn't make sense),
but, besides that, everything else seems fine.  I like the following setup
(which is similar to what you proposed, I believe), which I already posted.

            |-----Tax::Node
NodeI-------|
            |-----Tax::SpeciesNode
                |
SpeciesI -------|

Taxonomy::Node is-a NodeI
Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI
Bio::Taxonomy 'has-a' NodeI-implementing module
SeqIO has-a SpeciesI-implementing module

Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules;
specifically, a SpeciesNode for species ranks or below, and a Node for
anything else.

It would be nice to get this hammered out soon.  I think we can actually
start work on the Bio::Taxonomy::Node/SpeciesNode split; the interface
classes would be easy to add.  I could work on getting SeqIO to work with
Bio::Taxonomy::SpeciesNode when I can (sometime in the next few weeks).
Like I mentioned before, I got Bio::SeqIO::genbank already using it but
haven't committed it to CVS until we sorted out the class hierarchy and
interface-implementation issues.

I won't be able to add too much more to this for a few weeks, unfortunately.
I need to prepare for a conference as well as finish up a ton of bench
research.  I'll try keeping up though...

Chris

> :-) I think we're largely in agreement. As for node_name() I fully
> understand the motivation, but it needs to be understood that the
> attribute's value will be based on a largely arbitrary choice unless
> it is set directly by the user.
> 
> 	-hilmar
> 
> On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote:
> 
> > Hilmar Lapp wrote:
> >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote:
> >>
> >>> Bio::DB::Taxonomy::flatfile
> >>> ---------------------------
> >>> [...]
> >>>
> >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it
> >>> makes the
> >>> division as a three letter code, like 'PRI'. However, for
> >>> consistency
> >>> with entrez and the scientific_name() of the node the division is
> >>> supposed to correspond to, it is now stored as the full name, like
> >>> 'Primates'.
> >>
> >> What about adding a method division_code() which would return the 3-
> >> letter abbreviation?
> >>
> >> The abbreviation may be needed by flat-file writers, so it may be
> >> handy to have in some cases.
> >
> > As far as I know you can't get the 3-letter version via entrez, so no
> > other module can really expect to be able to get it, not knowing which
> > database (flatfile.pm or entez.pm) the taxonomic information is
> > coming from.
> >
> > But of course it would be somewhat harmless to add division_code()
> > anyway. It might be better done as a -code => 1 option to division()?
> >
> >
> >>> The names->id solution also stores the artificially uniqued names
> >>> like
> >>> 'Craniata <chordata>', allowing you for the first time to
> >>> retrieve the
> >>> correct id. Previously the search would have simply failed
> >>> completely.
> >>>
> >>> The names->id solution now handles nodes with scientific names of
> >>> 'xyz
> >>> (class)', allowing you to retrieve the id with both get_taxonids
> >>> ('xyz')
> >>> and get_taxonids('xyz (class)'). Previously only the latter would
> >>> work.
> >>
> >> Should angle brackets be allowed too?
> >
> > Allowed in what sense? You can indeed search for both
> > get_taxonids('Craniata <chordata>') [returns a single id] and
> > get_taxonids('Craniata') [returns multipe ids, one of which is the
> > previous answer].
> >
> >
> >> Maybe there should also be a -names parameter which accepts a hash
> >> reference with keys being the kind of name (scientific, common, etc)
> >> and the values being array references with the set of names of that
> >> kind?
> >
> > Not sure what you mean. name() has that data structure, though you're
> > not supposed to set its hash ref directly.
> >
> >
> >>> or the $node->classification() array.
> >>
> >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy
> >> brought over from a flawed (because flat) object model in
> >> Bio::Species.
> >
> > Yes, I agree.
> >
> >
> >>> NOTE: entrez modules (and website) cannot cope with '<something>'
> >>> in the
> >>> query, failing searches like 'Craniata <chordata>'. For this
> >>> reason, if
> >>> get_taxonids() is given a query with '<something>' it will
> >>> immediately
> >>> return undefined, saving a pointless website access.
> >>
> >> If there is a 'next-best-thing' that is still semantically compatible
> >> with the API documentation, I would do that.
> >>
> >> In this case, if there is a <something> in the query the entrez
> >> module should strip it and automatically use the rest for searching.
> >> If indeed multiple IDs match there should be a warning to inform the
> >> user that entrez cannot use the <something> notation to limit the
> >> query results.
> >
> > I wouldn't like this. I actually had it working this way initially,
> > but
> > decided that if someone entered 'xyz <something>' they really didn't
> > want multiple ids, expected to get multiple ids with just 'xyz' and
> > don't want their query made something else and then be warned about
> > it.
> >
> >
> >> In fact, you might as well provide an option to enable an automatic
> >> check for the correct branch for each ID if multiple ones are
> >> returned. I.e., if this option is enabled, the module would
> >> automatically query the parent nodes to see if <something> is in the
> >> lineage, and if not will remove the respective ID from the result
> >> set. The reason you may want to make it optional is because it
> >> potentially costs time. (but in reality I'm not sure why a client
> >> will not want to enable the option - so maybe this should even be
> >> default)
> >
> > I can certainly add that, it seems like a good idea. I don't, however,
> > see any scope for an option at all. What would the option be called?
> > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless,
> > imho. If the user queries 'xyz <something>' with that option, they're
> > just going to have to do for themselves manually what the method would
> > have done for them without that option, in order to get the correct
> > answer. It'll be slower that way, if anything. So the option would
> > actually be called
> > -
> > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt
> > le_slower
> > (!).
> >
> >
> >>> Bio::Taxonomy::Node
> >>> -------------------
> >>> [...]
> >>> classification() has a proper solution to finding the classification
> >>> when the array wasn't manually set.
> >>>
> >>> # Improvements
> >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name
> >>> ('common'). Now
> >>> it is an alias to name('scientific').
> >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so
> >>> flatfile and entrez and user-created nodes now implicitly associate
> >>> the
> >>> name of the node they create with its scientific name.
> >>
> >> I'm not even sure node_name() should just be deprecated. The methods
> >> falsely suggests that there is only a single and definitive name for
> >> the taxon node.
> >>
> >> In NCBI reality, this is only true for the scientific name of the
> >> node. In real reality, many nodes have multiple scientific names -
> >> taxonomy isn't static and therefore the scientific naming of nodes
> >> isn't either.
> >
> > For the programmer not using any database but just making up his own
> > nodes, I think he needs a node_name() because he may not be thinking
> > about anything fancy or realistic. He just want to give his node a
> > single name that he invents. node_name() seems like the ideal method
> > name to me.
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
> 
> 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list