[BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support

Mon May 25 22:16:08 UTC 2009

Hi, Diana:

What you wrote looks more or less OK.
I agree it is better to extend existing classes, as opposed to change 
them drastically.
One thing to keep in mind, is that many attributes are composed of 
multiple fields themselves, i.e. you would need to create a class for 
them (if such a class not already exists).
The most important element besides sequence, is the taxonomy class.

Since BioRuby does not contain a general purpose taxonomy class at this 
point, it might be worth spending some time in designing such a class.

I propose a taxonomy class with the following elements:
-scientific name (e.g. Nematostella vectensis)
-common name (e.g. starlet sea anemone)
-code (or mnemonic, as used by swiss-port) (e.g. NEMVE)
-rank (e.g. species)

phyloxml also has a URI for taxonomies, but I am not sure if this is 
important for a general taxonomy class.

On the other hand, a general taxonomy class might also have
- authority (e.g. Stephenson, 1935)
- aliases []

(if these elements are considered important, they of course could be 
added to the next version of phyloxml)

What do people think about this?

Christian

Diana Jaunzeikare wrote:
> Hi all,
>
> Since there are much more elements in PhyloXML than in Bio::Tree I 
> propose to make a class PhyloXMLNode which inherits from Bio::Tree::Node.
>
> PhyloXMLNode:
> # attributes from Bio::Tree::Node
> * bootstrap
> * bootstrap_string
> * ec_number
> * name
> * scientific_name
> * taxonomy_id
>
> #new attributes
> * id_source
> * confidence [] ([] means array of elements)
> * color
> * node_id
> * taxonomy []
> * sequence [] (Bio::Sequence object)
> * events
> * binary_characters
> * distribution []
> * date
> * reference []
> * property []
>
> Also, since <phylogeny> element does not only consist of <clade> 
> elements, but other elements also, Bio::Tree class should be extended.
>
> PhyloXMLTree
> #inherited from Bio::Tree
> * options
> * root
>
> # new attributes
> * rooted (boolean)
> * rerootable (boolean)
> * branch_length_unit
> * type
> * name
> * id
> * description
> * date
> * confidence []
> * clade_relation []
> * sequence_relation []
> * property []
>
>
> I think inheritance is better than creating a separate class, because 
> then users will be able to use Bio::Tree as before, but also being 
> able to read PhyloXML data files. Also then conversion from PhyloXML 
> to other formats will be easy since Bio::Tree class has output_newick, 
> output_nhx, output_phylip_distance_matrix methods.
>
> Diana
>
> Project Page:
> https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby
>
> On Thu, May 21, 2009 at 9:03 AM, Chris Fields <cjfields at illinois.edu 
> <mailto:cjfields at illinois.edu>> wrote:
>
>     Actually, as Perl's XML::LibXML::Reader is described it almost sounds
>     perfect, though I'm unsure of backtracking to a specific node in the
>     tree (and thus post/pre-order of nodes). Saying that, I would be
>     surprised if it weren't possible, though.
>
>     chris
>
>     On May 20, 2009, at 11:02 PM, Christian M Zmasek wrote:
>
>     > Hi:
>     >
>     > Thanks for the detailed replies by Hilmar and Chris!
>     > I think it is a very good idea to keep such very large trees in
>     > mind, and possibly implement a solution which only loads requested
>     > nodes into memory (as described by Hilmar and Chris) if there is
>     > enough time left at the end of the project.
>     >
>     > Re "It's tricky with re: to a number of aspects, but it can be
>     > done.  For  instance, if one wanted to modify the created nodes
>     > (i.e. if the nodes  are mutable), or creating a generic Lazy set of
>     > classes capable of   dealing with multiple formats."
>     >
>     > How would you do post-order or pre-order iteration of nodes?
>     > Wouldn't you have to back and forth in the file?
>     >
>     > CZ
>     >
>     > Chris Fields wrote:
>     >> On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote:
>     >>
>     >>
>     >>> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote:
>     >>>
>     >>>
>     >>>> I think it is perfectly acceptable to expect to have enough
>     memory
>     >>>> to keep at least
>     >>>> one tree in memory
>     >>>>
>     >>> Sounds like a good and perfectly reasonable starting point to me
>     >>> too.
>     >>> It's also the way other toolkits (such as BioPerl) work.
>     >>>
>     >>> Having said that, I don't find it inconceivable that we may be
>     >>> working
>     >>> with trees in the near future that don't fit into memory for a 1GB
>     >>> RAM
>     >>> machine if they are richly decorated (which is something that
>     >>> phyloXML
>     >>> wants to enable, isn't it?). Solving that to me though seems to be
>     >>> question of writing an appropriate Tree implementation that
>     >>> happens to
>     >>> store most of the data on disk rather than in memory, and not an
>     >>> issue
>     >>> for how to write a parser. Ideally though, the parser uses a
>     factory
>     >>> for creating the (tree and/or node) objects, so that later it
>     can be
>     >>> made to use an on-disk Tree implementation simply by passing it
>     >>> another factory. I.e., ideally the parser would not assume and
>     hard-
>     >>> code the Tree implementation class.
>     >>>
>     >>> Just my $0.02.
>     >>>
>     >>>     -hilmar
>     >>>
>     >>
>     >> This could be implemented in a lazy way or using lightweight
>     >> objects.   The Tree object itself contains the XML parser or a
>     >> reference thereof  (probably LibXML Reader-based) and creates the
>     >> relevant nodes as  needed.  The only thing needed would be some
>     >> light parsing to indicate  start-end file points.
>     >>
>     >> It's tricky with re: to a number of aspects, but it can be done.
>     >> For  instance, if one wanted to modify the created nodes (i.e. if
>     >> the nodes  are mutable), or creating a generic Lazy set of classes
>     >> capable of   dealing with multiple formats.
>     >>
>     >> Just in case anyone's wondering, I have been thinking along these
>     >> lines for a while re: BioPerl, Bio::Seq, and very large files... ;>
>     >>
>     >> chris
>     >>
>     >
>
>     _______________________________________________
>     Wg-phyloinformatics mailing list
>     Wg-phyloinformatics at nescent.org
>     <mailto:Wg-phyloinformatics at nescent.org>
>     https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics
>
>