[Biopython-dev] upcoming Bio.PDB enhancements - RNA

Tue Jun 1 18:25:52 UTC 2010

On Tue, Jun 1, 2010 at 10:11 AM, Kristian Rother <krother at rubor.de> wrote:

> Hi,
>
> >>> from Bio.Struct import RNA
> >>> # Would this work for you, Kristian?
> >>
> >> Yes, it would be more descriptive than the originally proposed Bio.RNA .
> >> I
> >> am just concerned whether I could keep the 2D structure-related modules
> >> in the same package.
> >
> > I don't necessarily see a problem with Bio.Struct or Bio.Structure
> > covering
> > both 2D and 3D structures. Does this 2D stuff include file parsers? That
> > would complicate plans for Bio.Struct.read() etc. Maybe Bio.RNA is
> better.
>
> Yes, currently, RNA contains 2D stuff. It would complicate Struct.read().
> On the other hand, the 2D stuff is independent from the 3D modules - could
> be split into two packages -- but I think keeping RNA is simpler.
>
> Best Regards,
>    Kristian
>
>
I could be totally wrong here, but I think it's useful to lay out some
assumptions and intuitions explicitly.

To me, secondary structure is not really a separate dimension in its own
right, the way tertiary structure corresponds to 3D space and primary
structure corresponds to a linear sequence. Instead, secondary structure has
meaning in 3D space, but is usually serialized as a linear sequence. That
is, we want to parse something that resembles a sequence, but be able to map
it onto a 3D structure. (More for proteins than for RNA, usually.)

(For non-RNA folk, here's an example of RNA secondary structure:
http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
)

For instance, the output of DSSP and Jpred describes a protein's secondary
structure, but the input to DSSP is a 3D structure, while Jpred accepts a
protein sequence. The representation of secondary structure isn't distinct
from either of these. I'd want both of these available in Bio.Struct
(eventually).

This means that some interaction between Bio.Struct and SeqIO is necessary.
It would be neat if secondary structure regions were represented as
SeqFeature instances, and secondary-structure parsers returned some kind of
subclass of SeqRecord -- or a standard SeqRecord containing a special kind
of Seq.

The secondary-structure parsers for RNA and proteins should be separate,
too, since the annotated features are different. So the function
Bio.Struct.read() can apply exclusively to 3D structures. Would it be
reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary
structures -- assuming that anything that's not a secondary structure, 3D
structure, or nucleotide sequence is something special that belongs in its
own module?

As for protein secondary structure, it's usually associated with a sequence
or a structure, so maybe we could get by with storing that information in an
ordinary Structure or SeqRecord object without inventing a new subclass.

Best,
Eric