[BioRuby] [GSoC][NeXML and RDF API] Code Review.

Anurag Priyam anurag08priyam at gmail.com
Fri Jun 25 07:34:21 UTC 2010


On Fri, Jun 25, 2010 at 12:19 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> > I have used libxml2 streaming api, without actually streaming the
> document
> > to the user. The cursor does not move through the document when you
> iterate
> > over elements( phyloxml does that ). I am parsing the document at one go;
> at
> > the start, and storing the objects in memory. Should we want to switch to
> > streaming, using libxml's streaming API from start should make it easier.
> >
> > Yes it is libxml2 these days. The site states that it works with ruby
> 1.8. I
> > am myself working with 1.8.7. I will have to test the compatibility with
> > ruby 1.9.
>
> OK, glad to see that libxml is a standard package these days -
> though it has some horrific error handling. At least it is fast.
>
>
Yea it is fast but it has its own share of bugs. Now, I myself have started
working on the ruby-libxml code and helping in maintaining it.


> How much time would it cost you to stream the data - and what does it
> mean with regard to changing the API? I guess, in general, NeXML
> files won't be that large, so it may not be that important (Rutger)?
>
> Pj.
>
>
I mean switching the parsing implementation to streaming from "parsing at
the start" and not the API. Just that using Reader API over the DOM API
would help in the switch. Even if we do not switch, the Reader API offers a
more memory efficient solution than the DOM API.

Btw, I am not in a favour of switch. You cannot move backwards in document
that way. I can not fetch a tree by id if I the cursor is ahead of that
tree. Doing nexml.each_characters and nexml.each_trees is impossible with
pure streaming. I will have to stream one while cache the other. Otus and
otu provide a one to many relation with trees and characters, and rows. An
API call of the type otus.trees or otus.characters or otu.seuences would be
impossible( not that I have already added the API call ). Imo, NeXML is
non-linear and not meant to be streamed. Besides other NeXML implementations
also parse the file at the start.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642



More information about the BioRuby mailing list