[Bioperl-l] Bio::Taxonomy changes

Tue Jul 25 18:24:12 UTC 2006

Sendu, you'll have to make the changes how you see fit.  You see my point
now, which is great.  

>From my perspective, all the object type (used to contain taxonomy file
information) needs to contain is the scientific name and common names like
the SOURCE line abbreviated name and the actual GenBank common name, if
present.  All the other cruft (i.e. genus/species/subspecies) can be
excised, and the proper taxonomic information, if wanted, could be accessed
via the object and it's TaxID.  Organelle and lineage information needs to
be retained (for the non-taxonomists) and could be stored in that object,
bumped to SimpleValue objects, or just set (alternative, since the data is
small) using a get/set value within the sequence object itself.  This would
be the bare-bones approach, which Node can fulfill.

I also like Hilmar's proposal about including optional lookups, which
greatly increases the flexibility when screening sequences.  This will
likely require a more complicated object structure (i.e. taxonomy with
nodes).  You suggested a Taxonomy-like object which would work; but don't
force Bio::Species into the mix.  Why not just use a simple Bio::Taxonomy
object for that (Hilmar's point).

When one asks for $species->species, they'll get a Node or Taxonomy,
whichever is used (that's up to you).  The Node represents a more-barebones
variation, while the Taxonomy object scheme would be more fully-realized.
Either way will work for me.  Just don't call it 'species'.  ; >

Once this is all done, will we really have a need for Bio::Species?  That's
my other point.  The only real use for it was as a container object for
sequence data.  That job is now done via a Taxonomy/Node object.  The only
real use it would have is as a container for taxonomic information for
species ranks or below.  I think Node/Taxonomy can handle evan that though,
so now it's also redundant.  If a class is not useful and is redundant,
maybe it should be deprecated.

Anyway, I can't get involved anymore at this point; I'm too busy with
getting ready for the Kadner Institute next week.  Good luck!

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Tuesday, July 25, 2006 12:49 PM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes
> 
> Chris Fields wrote:
> > If I were to get an object back that was labeled Bio::Species, as a
> > biologist I would expect it to be part of a taxonomy, not the actual
> > Taxonomy itself.
> 
> I think this is the most important sentence in the discussion. Ok, so
> it's clear to me that a better solution is needed than my
> Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I
> also needed to start trying to code my Taxonomy proposal to see some
> issues with it.
> 
> 
> [... in another email...]
> > I'm trying to view this as an outsider would,
> > a biologist not familiar with the Bioperl class structure.
> 
> Ok, let's come up with a proposal that makes sense to the biologist and
> better matches Jason's original idea.
> 
> ---- long post follows; there's a summary at the end
> 
> As a biologist when I consider a species I have the following primary
> questions. Let's see how we would answer them using a) Bio::Species and
> genbank.pm as they are now, b) Bio::Species if it was a 'pure'
> Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species
> and used Node directly), and Chris' updated genbank.pm. Let's say we got
> our species information from a genbank file where the scientific name
> and tax id are available to be parsed out.
> 
> # What is the species' name?
> a) Not guaranteed to be correct.
> b) Correct thanks to recent changes to Node, just use scientific_name()
> 
> 
> # What is the lineage of this species?
> a) I can get a classification array with classification(). It's a bit
> rubbish though, I can't tell what any of the array elements are supposed
> to be.
> b) A pure Node wouldn't store the lineage on itself. There are two
> obvious solutions: 1) add cruft to Node by giving it a classification()
> method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has
> the benefit of telling me what rank each ancestor was, if that
> information had been in the file (more likely, if Node was generated
> from database). Problem: get_Lineage_Nodes() only works if it can
> $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id);
> which obviously doesn't work if the nodes in our lineage didn't come
> from a database, but from the parsing of a genbank flat file. As we
> parse the genbank file we can certainly make nodes for each word in the
> list:
> inside genbank.pm... @class = reverse @class;
> my @nodes; my $fake_id = 1;
> foreach my $sci_name (@class) {
>    push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id =>
>                                      $fake_id++, parent_id => $fake_id);
> }
> But how do we keep these nodes and make them returnable later by
> get_Lineage_Nodes? Perhaps:
> my $taxonomy = new Bio::Taxonomy;
> foreach my $node (@nodes) {
>    $taxonomy->add_node($node);
> }
> ...
> my $make = Bio::Taxonomy::Node->new();
> ...
> $make->db_handle($taxonomy);
> Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node
> which only accepts a rank). Of course this is ugly, storing a Taxonomy
> in our database handle. We could have a new Bio::DB::Taxonomy:: class
> instead, that treated a classification array like a database? It could
> have the added bonus of building up an entire database internally as
> more input arrays are given to it, able to therefore give each node a
> unique but consistent id. It would break if one time you gave it qw(Homo
> Primates) and another time qw(Homo Hominidae Primates), however. Ideas?
> 
> 
> # What if I don't want the whole lineage, just to know what a specific
> rank like genus is for my species?
> a) use genus(), but not guaranteed to be correct.
> b) two solutions: 1) add cruft to Node by adding a genus() method: as
> good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until
> you find a node with your rank() of interest. Same problems as for
> lineage question, but also it would be nicer to have a
> get_node('rank_name') style method. But such a method belongs in
> something like Bio::Taxonomy, not Node. At the very least a method like
> genus() would be implemented using pure Node methods like
> get_Parent_Node(), returning undefined if no parent had a rank() of
> 'genus', never guessing it.
> 
> 
> # Is this species the same as another species?
> a) Not guaranteed to be correct. (no unique id so forced to compare names)
> b) Correct answer by using object_id() method, along with Chris' change
> to genbank.pm.
> 
> 
> # What is the most recent common ancestor of this species and another?
> a) Can't be answered.
> b) Use get_LCA_Node(), but same issues as the lineage question, since
> get_LCA_Node requires a working get_Lineage_Nodes(). It also requires
> correct (unique) ids for all nodes in all lineages to give the
> guaranteed correct answer. But at least you /might/ get the correct
> answer even using only the data in genbank files and no db lookup.
> 
> 
> ---- summary:
> 
> It seems like the main problem with Node right now is that it has
> classification() and things like genus(). I propose pure Node method
> solutions to answer the questions classification() and genus() were
> implemented to answer, but in a better, cruft-free way.
> 
> Bio::DB::Taxonomy::genbank anyone?
> 
> Then if you started with a Species/Node generated by a genbank parse,
> and wanted certain questions answered correctly, you only have to set a
> different db_handle(). The Node only stores the static and hopefully
> correct information about itself, whilst all other questions go via
> db_handle, so you can dynamically swap back and forth between databases
> depending on if you need speed or accuracy.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l