[BioRuby] Alignment plugin

Mon Apr 26 17:25:31 UTC 2010

What you describe below is not what I meant, though it's also very
important w.r.t. preserving the provenance of annotations. We're
thinking in a number of different directions and so the requirements
are starting to creep in :-)

What I meant is, by long-winded example, the following: imagine you're
studying the phylogeny of lemurs, and you want to look at
morphological and behavioral characters. Here's what a character state
matrix might look like:

D._madagascariensis - 0
H._aureus 4 1
H._simus 6 ?
H._griseus ? 1

The first column captures the number of teeth in the lower-jaw
toothcomb. Some lemurs use the incisors of the lower jaw as a grooming
apparatus, and they have (I believe) either 4 or 6 teeth in that
"comb". D._madagascariensis does not have this apparatus at all, so
its state for this column could be coded as "-", conceptually a bit
like a gap in an alignment, interpreted as "does not apply". We simply
have no data for H._griseus, so we code it as "?", meaning "missing".

The second column captures activity pattern, such that "0" means
"nocturnal", and "1" means "diurnal". You can imagine that we might
not know when H._simus is active, so a state "?" could be valid for
this column, but a state "-" definitely isn't: the animals are either
nocturnal or diurnal (or we don't know exactly which one of the two
applies).

To some extent, a matrix with such characters would be like an
alignment, and in many cases you would analyze this data using the
same tools for phylogenetic inference, like paup, phylip, mrbayes,
etc. Also, the same data formats (nexus/nexml, phylip) describe both
these matrices and alignments.

So it would make sense to implement them as objects within the same
class hierarchy, and the projects where I've looked at the insides
(Bio::Phylo, Mesquite, DendroPy, CIPRES, JEBL) all do this, though not
all in the same way. BioPerl does not really do this in that it has no
explicit concept of categorical character state matrices beyond
molecular ones. It's hard to see how something like this could be
retrofitted elegantly into BioPerl, which is why I am ringing the
alarm bells now :-)

The problem that needs to be solved is to come up with a way to
describe for each column which state symbols are allowable (and
potentially annotate them) without creating a baroque beast that can
stay in its cage anyway for the 90% of the time where we're dealing
with molecular data where all columns have the same semantics and for
which we have no further annotations per column.

The way I've dealt with this in the past is to create an object that
has a map where every key is a state symbol, and the values are lists
of zero or more other possible states that the (i.e. N maps onto A, C,
G, T but "-" maps onto an empty list). In extreme cases, such as the
morphological matrix I described, you would have one such object
attached to every column in the matrix. But for molecular data the
object would be a singleton for the whole alignment.

If you buy this line of thinking (YMMV), you might agree that a single
sequence may need complicated helper objects and coordinate systems to
keep track of the sort of mapping semantics that come into play once
the sequence becomes homologized with others as building blocks for
alignments/matrices.

I hope all this makes some amount of sense :-)

Rutger

> I think I understand what you mean here. The way I see it is that the
> sequences are immutable lists of nucleotides/amino acids. State can
> be at row, column or individual matrix point level.
>
> I guess it is impossible to impose the way people want to use the data
> structure. Either they use state as a loose component (could be a
> matrix) projected on the sequences, or (if our format allows it) they
> could maintain state at each of the three levels (row, column, point).
>
> In my case I would like to add state into the data structure (one
> advantage could be that it would be relatively easy to export, also to
> RDF).  We have an alignment:
>
>  aln = Alignment.new(sequences)
>
> I would like to annotate column 4:6 as having high homology
>
>  aln.column(1..4, :homology=>HIGH)
>
> maybe I want to remove a part of sequence 3 and mark it as such
>
>  aln.delete(3, 20:30)
>  aln.sequence(3, :position=>20..30, :deleted=>TRUE)
>
> or indicate an ORF
>
>  aln.sequence(3, :position=>40..65, :orf=>TRUE)
>
> and fetch information, like quality scores
>
>  sequence = aln.sequence(3)
>  quality = sequence.quality(:position=>40..65)
>
> Any variations, thereof. State would be maintained inside
> Alignment(Column), Sequence or Nucleotide/Aminoacid.
>
> Pj.
>

-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com