[BioRuby] [GSoC][NeXML and RDF API] Sequences( doubts )

Sat Jul 3 14:43:32 UTC 2010

On Sat, Jul 03, 2010 at 04:43:43PM +0530, Anurag Priyam wrote:
> This is going to be a long mail.
> 
> NeXML's characters tag serves as a storage block for sequences. Sequences
> can be described in NeXML in two ways, raw( with the seq tag ) and granular(
> with the cell tags ). NeXML offers six kind of sequences :
> 1. Protein( AA )
> 2. DNA
> 3. RNA
> 4. Restriction
> 5. Standard
> 6. Continuous

How do these sequences differ? In name only? Can you store them as
tuples:

(:dna,sequence)
(:rna,sequence)
(:re,sequence)
etc.

You could argue for a new SequenceType object. To store type +
sequence.

> As of now, the NeXML parser just returns the sequence as a string. It should
> return Bio::Sequence. BioRuby already has classes to work with AA and NA
> sequences. I was thinking of adding classes to represent Restriction,
> Standard and Continuous sequences. Should I work on adding support for these
> as a core BioRuby classes or just as a part of NeXML lib? I will have to
> adapt Bio::Sequence class to recognize the new sequences.

I think your library needs to return the simplest type possible. Even
in standard Ruby containers (even simpler than BioRuby's types). That
makes for the most flexible implementation for others to use.
BioRuby's types may change in the future too - I am working on that.

Your library is not really in the business of creating new types -
unless you create new functionality - like an alignment algorithm, or
some transformation to a new type. 

Better keep it simple.

If I have a NeXML file containing an alignment of sequences - I
expect simply to pull out those sequences with their ID's. Right?

You could return a BioRuby Alignment object, but that is overkill. I
can make one myself, which I want to use, my own type of MyAlignment.

What I really want is a list of (id, list[nucleotide]) or (id, String)
in BioRuby's case, if that is what is stored in NeXML.

in pseudo code

  seqlist = NeXML.read(fn).fetch_alignment
  print seqlist.first
  > "id","agtct"

or in the form of an iterator

  NeXML.read(fn).fetch_alignment.each_seq do | id, seq |
    do something
  end

and likewise use cases for other scenarios.

For RDF the use cases are similar, I would guess.

  NeXML.read(fn).fetch_alignment.to_rdf

Keep it simple, again. The thing is that most people over complicate
things in OOP. All, and I mean all, Bio* projects over complicate
things.

> Why does the Bio::Sequence#guess method use the some 90% way of recognition
> between AA and NA? Why not use regexp instead?

I am not a great fan of guessing formats. It is always error prone.
Both amino acid sequences and nucleotide sequences can consist of a
combination of shared letters.

Still, I guess regex's are slower. Feel free to come with an
alternative and measure how well it does. But I have trouble seeing
why you need it. 

Pj.