[Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul

Hilmar Lapp hlapp at gmx.net
Sun Aug 6 19:38:17 UTC 2006


Wow! This is quite a number of changes to digest. Thanks for the  
detailed documentation.

I have three comments.

1) It sounds a bit that you changed the behavior of get_lca() such  
that users may have to adjust their code? If this is true, then this  
needs to be made clear in the 1.6 release as that part will not be  
backward compatible. If this is not true, then why did you have to  
change the implementation of Bio::Tools::Phylo::PAML to make tests  
pass? I.e., to what extent can what broke Bio::Tools::Phylo::PAML  
also break someone's script?

2) I can't find object_id() on Tree::Node or Taxonomy::Taxon. Where  
is/was it? The reason I am asking is that this method is part of the  
Bio::IdentifiableI API and therefore if you want to deprecate it you  
are suggesting to deprecate implementing Bio::IdentifiableI, and the  
rest of those methods need to be deprecated along.

3) Your whole email should probably go on the wiki, linked somewhere  
under documentation or release notes. Or somebody has a better idea?

	-hilmar

On Aug 5, 2006, at 12:42 PM, Sendu Bala wrote:

> After the initial round of changes to Taxonomy described at
> http://bugzilla.open-bio.org/show_bug.cgi?id=2047 (now committed),
> further changes will allow for the transition of Bio::Species to
> Bio::Taxonomy::Node (renamed to Bio::Taxon), and for Taxon to be fully
> usable without external database access.
>
> In brief: rename Bio::Taxonomy::Node to Bio::Taxon, make Bio::Taxon
> implement Bio::Tree::NodeI, make Bio::Species a Bio::Taxon, remove all
> Bio::Species-related-backward-compatible methods from Bio::Taxon,  
> create
> Bio::DB::Taxonomy::list, update Bio::SeqIO::genbank et al.
>
> The following is the set of changes that have been made (with all
> relevant tests passing), but not committed. Feedback is encouraged.
> These notes are also available at
> http://bugzilla.open-bio.org/show_bug.cgi?id=2061 for easier reference
> later.
>
>
> (in the following notes, use of the name-case word 'Taxon' refers  
> to the
> module Bio::Taxon or instance of that class, while 'taxon' refers  
> to the
> concept of a taxonomic unit)
>
>
> Bio::DB::Taxonomy, ::*
> ----------------------
>
> # API-CHANGES
> get_Taxonomy_Node() renamed get_taxon(). get_Taxonomy_Node() is a
> synonym of get_taxon(), eventually to be deprecated.
>
> New methods ancestor() and each_Descendent() correspond to similar
> methods in Bio::Taxon and Bio::Tree::NodeI, freeing up the need to  
> store
> parent_id on each Taxon.
>
> New internal method _handle_internal_id(). See Implementation notes  
> below.
>
> # Implementation changes
> Normally when you create a Bio::Taxon it automatically receives a new
> unique internal id. However when you request the same Taxon from a
> database more than once you always get an object with the same  
> internal
> id (allows get_lca to work, allows you to modify one copy of a  
> returned
> object but still compare it to another copy and see they are  
> supposed to
> be the same taxon). This even applies across different databases. The
> Taxon objects returned will still have different memory locations.
>
>
> Bio::DB::Taxonomy::flatfile
> ---------------------------
>
> # API-CHANGES
> get_Children_Taxids is deprecated - method no longer part of the
> DB::Taxonomy interface, and superseded by each_Descendent (which is
> actually implemented by all databases).
>
> # Implementation changes
> No longer includes the fake root node 'root'; there are multiple roots
> now (10239, 12884, 12908, 29384 and 131567). This means when  
> getting the
> lineage you no longer have to remove the root node. This is now
> consistent with the results possible with entrez.
> NB: You have to delete your current indexes before you will notice the
> change.
>
>
> Bio::DB::Taxonomy::entrez
> -------------------------
>
> # API-CHANGES
> get_node has new option -full that tells it to retrieve full  
> details on
> a taxon from the website. (Otherwise, it may return a Taxon with  
> minimal
> information if only minimal information had previously been cached.)
>
> # Implementation changes
> Caches the data it gets from the website and tries to minimise the
> number of website accesses it does.
>
>
> Bio::DB::Taxonomy::list
> -----------------------
>
> # NEW
> An implementation of Bio::DB::Taxonomy that accepts lists of words to
> build a database. Used especially by Bio::Species for backward
> compatibility purposes, but also useful generally to quickly and  
> easily
> create a lineage of Bio::Taxon objects/ a Tree.
>
>
> Bio::Tree::TreeI
> ----------------
>
> # BUG-FIXES
> number_nodes() returned the number of descendants belonging to the  
> root
> node, but forgot to count the root node itself. Now number_nodes() ==
> scalar(get_nodes()).
>
>
> Bio::Tree::Tree
> ---------------
>
> # API-CHANGES
> Added -node option to new() which will call get_lineage_nodes() on the
> supplied NodeI and set the tree root that way. This is so you can  
> easily
> make a tree from a Bio::Taxon. In order that the Tree resulting from a
> Bio::Taxon with a db_handle doesn't end up pulling in the entire
> database, in the process of finding the root from the -node,  
> ancestor()
> / add_Descendent() is set for each member of the lineage, which means
> the database will no longer be asked what the ancestor or  
> descendents of
> the taxa are.
>
>
> Bio::Tree::TreeFunctionsI
> -------------------------
>
> # API-CHANGES
> New method get_lineage_nodes(). Returns all the ancestors of a
> particular node, up to the tree's defined root node.
>
> get_lca() can now also accept just a list of nodes, and also more  
> than 2
> nodes.
>
> Removed _check_two_nodes() since no longer necessary.
>
> New method splice(). Removes requested nodes from a tree, making the
> ancestors of the removed node's descendants the removed node's  
> ancestor
> (ie. remove nodes without making the tree fall apart).
>
> New method contract_linear_paths(). Splices out all nodes in the tree
> that have an ancestor and only one descendant.
>
> New method merge_lineage(). Merges a lineage of nodes with an  
> existing Tree.
>
> # Implementation changes
> get_lca() uses get_lineage_nodes(), and is the correct implementation;
> previously not guaranteed to give correct answer. Can get the lca of
> more than 2 nodes.
>
> reroot() uses get_lineage_nodes().
>
> Methods distance(), is_monophyletic() and is_paraphyletic()
> reimplemented with the new get_lca().
>
> find_node() no longer warns about an unknown search type (allowing you
> to search on -rank and any other thing in the future).
>
>
> Bio::Tools::Phylo::PAML
> -----------------------
>
> # Implementation changes
> Methods that make use of get_lca() reimplemented with the new  
> get_lca().
> (otherwise, PAML tests no longer passed)
>
>
> Bio::Tree::Node
> ---------------
>
> # Implementation changes
> ancestor() now correctly removes and adds descendant from previous/new
> ancestor when changing ancestor.
>
>
> t/Node.t
> --------
> Added tests for setting ancestor()
>
>
> Bio::Taxonomy::Node
> -------------------
>
> # DEPRECATED (name change)
> isa Bio::Taxon
>
> # Implementation changes
> No code; delegates to Bio::Taxon
>
>
> Bio::Taxon
> ----------
>
> # NEW (name change from Bio::Taxonomy::Node)
> Changes below relate to changes to Bio::Taxonomy::Node
>
> # API-CHANGES
> Removed the following options from new(): -classification,
> -sub_species, -variant and -organelle. The corresponding methods  
> are no
> longer present.
>
> New option to new(): -id. For Tree::Node compatibility. -object_id and
> -ncbi_taxid are no longer mentioned in docs but still work.
>
> The -dbh option to new() no longer defaults to any database. A
> Bio::Taxon is now fully usable without ever setting a database handle.
>
> Removed the methods binomial(), species(), genus(), sub_species(),
> variant(), classification() and show_all(). Not appropriate to have
> rank-specific methods in a class that models any single rank.  
> Definitely
> not appropriate to store information about other taxons in a Taxon.
> These questions can be answered using Tree* methods, or with
> Bio::Species.
>
> Removed method organelle(). Organelle isn't part of a taxonomy. Other
> modules like SeqIO should have their own storage of organelle
> information as necessary (But Bio::Species retains organelle() in the
> mean time).
>
> Removed methods get_Lineage_Nodes() and get_LCA_Node(). For these  
> kinds
> of methods you should now use Bio::Tree::TreeFunctionsI methods.
>
> You can no longer set parent_id(). The id of your parent is determined
> by the Taxon that is your ancestor. This method is no longer needed
> (previously it was central to the workings of the object), so is now
> deprecated. It issues a warning if you try and set its value.
>
> get_Parent_Node() eventually to be deprecated, is now a synonym of new
> method ancestor(). (For Tree::Node compatibility.)
>
> get_Children_Nodes() eventually to be deprecated, is now a synonym of
> new method each_Descendent(). (For Tree::Node compatibility.)
>
> object_id() eventually to be deprecated, is now a synonym of new  
> method
>   id(). (For Tree::Node compatibility.)
>
> # Implementation changes
> is(also)a Bio::Tree::Node.
>
> division() was implemented via $self->name('division', at _). Now
> name('division') will only allow one value to be set, and division()
> only ever returns a single scalar or undef, never an array.
>
> common_names() returns the last common_name in scalar context (instead
> of first), so set/get/set/get works as expected with common_name().
>
> db_handle() similar to before when getting, but now setting the handle
> will locate $self in the new database (by id or name) and merge data
> (eg. if rank was 'no rank' and new database node has rank 'species',
> $self->rank() will become 'species').
>
> get_Parent_Node() (ne ancestor()) and get_Children_Nodes() (ne
> each_Descendent()) now use the Bio::Tree::Node implementation.
> ancestor() falls back to asking the database for the ancestor if  
> one had
> not been manually set by the user. each_Descendent does NOT fall  
> back to
> the database, preventing the whole database being pulled into a Tree
> object made with a Bio::Taxon.
>
> parent_id() now gets the ancestor Taxon with ancestor() and returns
> $ancestor->id().
>
> Had to remove the clean up methods from Bio::Tree::Node since they  
> were
> in a CODE ref, preventing Bio::Species objects from being frozen with
> Storable. Will come up with a better solution in the future.
>
>
> Bio::Taxonomy
> -------------
>
> # DEPRECATED
> Redundant
>
>
> Bio::Taxonomy::Taxon
> --------------------
>
> # DEPRECATED
> Redundant
>
>
> Bio::Taxonomy::Tree
> -------------------
>
> # DEPRECATED
> Redundant
>
>
> Bio::Taxonomy::FactoryI
> -----------------------
>
> # DEPRECATED
> Redundant
>
>
> Bio::Species
> ------------
>
> # Implementation changes
> Bio::Species isa Bio::Taxon.
>
> No method uses validate_species_name() any more. (but the method  
> remains
> unaltered, as does validate_name() which just returns 1 - no change).
>
> classification() set implemented as:
> Set db_handle() to a new Bio::DB::Taxonomy::list with the supplied
> classification array and make a Bio::Tree::Tree of self, stored in  
> self.
> Getting the classification implemented as:
> Return the scientific_name() of each Taxon returned by our
> tree->get_lineage_nodes.
>
> Methods ncbi_taxid(), division() and common_name() implemented by  
> Taxon.
>
> Methods species(), genus(), subspecies() and variant() no longer  
> get/set
> elements in the classification array or store direct values. They are
> implemented like:
> Ask our tree for the taxon with rank() eq method name and set/get
> the scientific_name of that.
> Otherwise, for methods species() and genus() assume we are rank()
> 'species', our parent taxon is rank() 'genus' and try again. For
> subspecies() and variant(), fall back to old implementation (store  
> data
> directly on self).
>
> binomial() prefers to simply return scientific_name() if we are a  
> Taxon
> with rank() 'species' and the scientific_name is at least a 2 word
> scalar. It interprets the 'FULL' option as wanting the trinomial name
> and prefers to simply return scientific_name() if we have rank()
> 'subspecies' or 'variant' and at least 3 word scalar. Failing these  
> two
> cases, it falls back on the old implementation (build 'genus species'
> from the classification), but with a little more intelligence to  
> try and
> not duplicate names.
>
> # Behaviour changes
> An indirect new behaviour is that the SeqIO modules will probably  
> return
> ->species() as the real species name (eg. 'Homo sapiens'), not the
> previously (and sometimes incorrectly) munged name (eg. 'sapiens').
>
> # Notes
> Stores a Bio::Tree::Tree on itself, had to remove its clean up methods
> since they were in a CODE ref, preventing us from being frozen with
> Storable. Will come up with a better solution in the future.
>
>
> Bio::SeqIO::*
> -------------
> A number of these modules make use of Bio::Species when parsing
> taxonomic information. They probably all have/had problems. I've only
> investigated genbank to any significant depth; the others need
> to be properly tested to see if when they read taxonomic data in they
> can output it again identically to the input file. It is probably the
> case that some fail at this currently. (I simply don't have time  
> myself
> to make all these modules perfect.)
>
>
> Bio::SeqIO::bsml_sax
> --------------------
>
> # BUG-FIXES
> It used to include the genus twice in the classification array of
> Bio::Species object. Now it doesn't.
>
>
> Bio::SeqIO::embl
> ----------------
>
> # BUG-FIXES
> When the OC lines include the species name, the Bio::Species
> classification array included the true species name as a rank above
> genus and the real genus duplicated as a rank above that. Now it  
> doesn't.
>
>
> Bio::SeqIO::genbank
> -------------------
>
> # BUG-FIXES
> Now that Bio::Species isa Bio::Taxon, it is possible to ensure that
> output of input matches the input (in the SOURCE and ORGANISM lines at
> least). Usage of Bio::Species re-implemented to get all tests in
> t/genbank.t to pass.
>
>
> t/genbank.t
> -----------
> Modified some tests to expect the correct answer, ie.
> $bio_species_obj->species now expects 'Mus musculus', not 'musculus'.
>
>
> t/Index.t
> ---------
> Modified some tests to expect the correct answer, ie.
> $bio_species_obj->species now expects 'Homo sapiens', not 'sapiens'.
>
>
> scripts/taxa/taxonomy2tree.PLS
> ------------------------------
> Added some extra options to define the location of the database  
> indexes
> and files, or use the entrez on-line database instead. (Note how  
> entrez
> and flatfile are now truly interchangeable.)
>
> Reimplemented using the new Bio::Taxon system. Now much simpler. You
> also get the correct answer, eg. instead of
> (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo
> sapiens")"Homo/Pan/Gorilla group")Hominidae)root;
> you now get
> (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo
> sapiens")"Homo/Pan/Gorilla group")Hominidae)"cellular organisms";
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================








More information about the Bioperl-l mailing list