[BioRuby] [GSoC][NeXML and RDF API] Code Review.

Thu Jun 24 13:54:11 UTC 2010

Hi Everyone,

I am going to review Anurag's code. Naohisa, and perhaps others, will
join in. 

A quick recap: Anurag is working on implementing an NeXML parser with
RDF support (for the semantic web).

NeXML is an XMLized and improved version of Nexus, and is used for
interchanging sequences, alignments and trees between
programs/services (correct me if I am wrong). A full descripion of
NeXML can be found at

  https://www.nescent.org/wg_evoinfo/Future_Data_Exchange_Standard

NeXML is an important standard, and very good to have in BioRuby.

Anurag: thanks for the good work, so far. I can see you have put a lot
of work in. And, I like your style. I can see you are a competent
programmer, so you can expect the worst criticism ;) I am going to
start with some high level questions.

Can someone who has worked with NeXML (Rutger) have a look at the
interface description on:

  https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

It looks natural to query this way, if you know what the NeXML files
contains (e.g. trees, or sequences). What would be the natural
approach if you do *not* know the contents? I.e. how does one iterate
over the NeXML object?

Anurag, your web page states you implemented a LibXML::Parser, and you
named it Parser. Meanwhile, it looks like you have implemented libxml2
streaming, using a Reader. This is a bit confusing. I presume you are
using the technique used in Diana's PhyloXML parser. You are requiring
the 'xml' package. Is that libxml2 these days, or is it actually
'libxml'? Does it work for all Ruby versions? libxml is an external
(binary) dependency, so it may not exist and fail.  PhyloXML does not
handle failure either.

The other high-level questions concern testing. For others, the unit
tests are here:

  http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/

I notice you have limited test input data. How can you be really sure
your code works for all cases? How can you be really sure that future
changes to the code don't break? And how are you going to measure
performance of your code?

Finally, getting down to some code. Most of the code is in a single
file:

  http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb

or

  http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb

I think it should be broken up. It would be logical to split by type
of elements - at least. I know in BioRuby we are ambiguous about file
sizes - I think a single file should describe one concept. That way
file names become self describing. Files larger than 300 lines tend
to be hard to digest - and probably point out some bigger issue.

Also, when I look at DnaSeqRow, RnaSeqRow and others derived from
SeqRow (line 2148 and onwards in element.rb), I can see duplicated
coding 'patterns'. You are repeating a concept. Would there not be a
more elegant way in Ruby to handle this? Hint: Inheritance is just one
mechanism, I see no real reason to use an inheritance tree. Why not
use one Sequence class for all of these which can contain different
formed elements? I bet the code would become a lot shorter and
(probably) less error prone. Take Ruby's Array container class as an
example - it is just one implementation of a container which allows
many types of elements.

A final comment for this session: The class/method descriptions are
not very informative. It may be early days - especially since we can
see some refactoring coming, but it usually helps to write out
examples giving the 'nicest' interface for people to use. And stick
those in the source code. Personally I favour rubydoctests, see

  http://github.com/tablatom/rubydoctest

I used these in bio/appl/paml/codeml/report.rb - these are examples
that double as tests. Kill two birds with one stone! The BioRuby
tutorial also uses doctests - i.e. the code in the Tutorial can be
validated against the installed bioruby. If you want to use this you
need an extra conversion - I have that tool.

Another possibility is to start using RSpec. 

  http://rspec.info/

I really like RSpec too - it is more of a replacement for unit
tests - and easier to understand, so Specs double as documentation.

I am interested to see what you want to do for RDF support. Maybe you
can write out the API as an RSpec? That would be a good start.

Do not hesitate to stand up to me. You will probably get support from
someone on this list ;)

Pj.

On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote:
> Last week, I worked on finishing implementation of Trees: trees, tree,
> network; and started work on the characters element. This weeks target is to
> complete the implementation of the characters element.
> 
> It would be awesome to have some code review including: implementation, API
> design, coding style and tests. I am planning to give a good amount of time
> in the fourth week in making the code more robust. It would make perfect
> sense to have some feedback to serve as guidelines :). The master branch and
> API discussion page are at:
> 
> [1] http://github.com/yeban/bioruby
> [2]
> https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby