[BioRuby] Alignment plugin

Pjotr Prins pjotr.public14 at thebird.nl
Mon Apr 26 16:30:55 UTC 2010


Hi Rutger,

On Mon, Apr 26, 2010 at 04:40:11PM +0100, Rutger Vos wrote:
> On Mon, Apr 26, 2010 at 4:04 PM, Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
> > Maybe we should start defining a basic sequence object. What would we
> > want from it, what should be core and what should be mixed in?
> >
> > Alignments and secondary structures should build on that.
> 
> In the interest of learning from other Bio* projects ;-) it should be
> noted that there is a bit of a mismatch between sequences as
> standalone objects on the one hand, and rows within character state
> matrices on the other, especially when you consider types of data
> beyond molecular sequences (e.g. morphological character state data).

Yes.

> Within a matrix there are columns such that every cell in a sequence
> now becomes a concrete instance of one of a limited set of character
> states for that character/column. Especially for morphological data
> there could be very esoteric ambiguity mappings from one state in that
> column to another. Imagine an alignment with unique mappings a la the
> IUPAC single character codings for each column. The upshot might be
> that you'd need a mapping object for each cell, though you'd use an
> immutable class for molecular data.

I think I understand what you mean here. The way I see it is that the
sequences are immutable lists of nucleotides/amino acids. State can
be at row, column or individual matrix point level.

I guess it is impossible to impose the way people want to use the data
structure. Either they use state as a loose component (could be a
matrix) projected on the sequences, or (if our format allows it) they
could maintain state at each of the three levels (row, column, point).

In my case I would like to add state into the data structure (one
advantage could be that it would be relatively easy to export, also to
RDF).  We have an alignment:

  aln = Alignment.new(sequences)

I would like to annotate column 4:6 as having high homology

  aln.column(1..4, :homology=>HIGH)

maybe I want to remove a part of sequence 3 and mark it as such

  aln.delete(3, 20:30)
  aln.sequence(3, :position=>20..30, :deleted=>TRUE)

or indicate an ORF

  aln.sequence(3, :position=>40..65, :orf=>TRUE)
 
and fetch information, like quality scores

  sequence = aln.sequence(3)
  quality = sequence.quality(:position=>40..65)

Any variations, thereof. State would be maintained inside
Alignment(Column), Sequence or Nucleotide/Aminoacid.

Pj.



More information about the BioRuby mailing list