[BioRuby] Alignment plugin

Pjotr Prins pjotr.public14 at thebird.nl
Mon Apr 26 12:53:47 UTC 2010


I am thinking of creating some new infrastructure for alignments.

The Bioruby alignment architecture is not great. It contains a lot of useful
functionality, but it is purely sequence organized. I did a writeup on the
Bioruby blog - on ALN support and colorized HTML - if you remember.

For completeness I checked the BioJAVA and BioPython implementations. 

The BioJAVA alignment classes are in a deep tree:

  biojava/alignment/src/main/java/org/biojava/bio/alignment

the implementation troubles me. Partly it is JAVA itself - which makes code
feel dispersed. Partly it is the implementation, which appears to be minimal. I
guess it is a work in progress.

The BioPython version looks like it is the best of the three. Some
separation of responsibilities. Good documentation, and good
validation and testing. I like that. Otherwise, functionally it is
mostly comparable to BioRuby.

The trick of designing good alignment classes is to make them small and fork
out responsibilities. The BioJAVA version does not contain much. The BioRuby
version has everything in one place, including the kitchen sink. BioPython goes
some way towards what it should be, but it does not look more
extensible than what we have (and I don't want to use Python).

It sucks. I don't feel like replicating all other code. At the same time I want
something cleaner. 

The PAML output adds information for each column of an alignment.
Besides we deal with the translated alignment too. So PAML requires a
dual alignment standard (NU+AA) with columnwise information (homology,
evidence of positive selection). Add to that the phylogentic tree. For
my current work I are going to add column-wise and row-wise 'meta'
information, which is used for output (both HTML and graphics).

I guess the best option is to write two BioRuby plugins. One for the
new alignment storage and one for PAML alignments, which will include
meta-info and output functionality. Questions:

* What is the way to store alignments - should gaps be represented as dashes?  
* Should we use a String format?  
* How do we handle multi-value fields (e.g. degenerates)?
* How do we handle quality scores (sequencers)?

I think the underlying storage format should not be String - as it allows
toying with the data - say, by embedding HTML. Properties, like
colors, should be added on top of the alignment structure, not within.
We should also allow for (future) stronger type checking of
nucleotides and amino acids.

If we can convert easily to the standard BioRuby alignment old
functionality can be retained. Though it may not always be that
natural.

With Ruby a string type may be the most obvious choice (a lists of
lists of a special nucleotide object is probably overkill, though it
should not be).

Anyone interested in participating?

With regard to plugins: for now I will merely create a separate

  pluginname/lib/bio/pluginname.rb

and add that to the include path. That should be OK for now. It will
allow adding it as a gem too.

Pj.



More information about the BioRuby mailing list