[BioRuby] Update on phyloXML support for BioRuby project

Diana Jaunzeikare rozziite at gmail.com
Wed May 20 14:51:26 UTC 2009


On Wed, May 20, 2009 at 2:09 AM, Naohisa GOTO
<ngoto at gen-info.osaka-u.ac.jp>wrote:

> Hi all,
>
> On Tue, 19 May 2009 17:07:59 -0400
> Diana Jaunzeikare <rozziite at gmail.com> wrote:
>
> > So, I think we have reached consensus that the best choice is
> libxml2-ruby
> > SAX based XML parser.
>
> In libxml2-ruby, I think LibXML::XML::Reader is the best choice,
> because it is memory efficient than DOM and its API is simpler
> than that of SAX. LibXML::XML::SAXParser is not bad, but I wonder
> if the SAX's callback based API makes our codes too complex and
> difficult to maintain.
>

I wrote sample code using both LibXML::XML::Reader and
LibXML::XML::SAXParser and I agree that SAX's callback based API might get
very complex and hard to maintain.


>
> > Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems
> > logical that the parser should return a Tree class object. By using
> > SAX parser we avoid the problem of having whole XML file in memory,
>
> I think so.
> Alternative way is to return an object of wrapper class which mimics
> Bio::Tree's API. However, it may be too hard to implement such class,
> and data type conversion from/to Bio::Tree is still needed even in
> this case. So, I think to return a Bio::Tree object is good.
>
> > I am a little confused about the require statements in BioRuby classes.
> It
> > looks like bio/tree.rb should hold a general class, but it requires
> > bio/db/newick.rb, but this file in turn requires bio/tree.rb.
>
> The only reason why bio/tree.rb requires bio/db/newick.rb is
> for the Newick and NHX output of the tree.  The codes will
> be refactored in the future.
>
> On Tue, 19 May 2009 14:54:18 -0700
> Christian M Zmasek <czmasek at burnham.org> wrote:
>
> > Hi, Diana:
> >
> > I think it is a good idea to have the parser return one tree at a time,
> > as opposed to returning a list of trees.
>
> I think so.
>
> > On the other hand, the same does not apply to nodes. I think it is
> > perfectly acceptable to expect to have enough memory to keep at least
> > one tree in memory (a good target size might be a binary tree with
> > ten-thousand external nodes and 200 bytes of annotation per node, which
> > according to my rough calculations would require less than 5MB).
> >
> > For your tree use cases, important ones to add are:
> > * iteration over all nodes
> > * retrieval/finding of specific nodes according to some criterion (e.g.
> > find all nodes for which the species is "E. coli")
> > * tree reconciliation (e.g. compare a gene tree to a species tree, in
> > order to determine duplications on the gene tree)
> >
> > In any case, all these applications/algorithms will be most time
> > efficient and easiest to implement with trees which are completely in
> > memory.
>
> In addition, it is easy to implement manipulation of trees
> (adding/deleting nodes and edges, etc.).
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>



More information about the BioRuby mailing list