[Bioperl-l] RNA fold

Chris Fields cjfields at uiuc.edu
Tue Dec 9 15:03:08 EST 2003


On Tue, 2003-12-09 at 11:12, Sam Griffiths-Jones wrote:
> On Tue, 9 Dec 2003, Chris Fields wrote:
> 
> > I think that you can use parenthetical formats for pseudoknot-like
> > structures (improperly nested Watson-Crick helices).  The idea is that
> > () would represent secondary structure, and other brackets {}[] would
> > represent higher-order structures, like so:
> >
> >      Helix         Pseudoknot
> > ______________       _______________
> > |            |       |             |
> > (((((....))))).[[[...((((..]]]..))))
> >                 |___________|
> >
> 
> Ah, if you're thinking of a generic format then different brackets are
> going to get you into trouble - Sean Eddy's INFERNAL suite (think
> HMMer for RNAs) already uses different brackets to markup different
> layers of nesting, like:
> 
> ..[[[[..<<<<<..>>>>>....<<<..>>>..]]]]..
> 
> There is an informal standard to incorporate pseudoknot info into the
> bracket notation using letters for non-nested base pairs:
> 
> <<<<<<<.<<<...AAAA..>>>>>>>>>>..aaaa......
> 
> The upper case stuff base pairs with the lower case stuff.  This seems
> like a really bad idea, but given that you're parsing the vast
> majority of the structure with brackets, and the most complicated
> known nested pseudoknot (in the alpha operon leader) only involves
> letters A, B and C, its not so bad.  Also this provides a natural
> separation for the algorithms that can only deal with nested
> interactions (SCFGs and the like) from those that can use everything.
> For what its worth this is how we markup such non-nested things in the
> Rfam database.

Yikes!  This is a problem, b/c I have seen many different ways of
showing secondary structure (CT table format, parenthetical, XML, etc).

> > Of course this is where the problem lies, b/c all structures in this
> > format are constrained to simple 1:1 base associations, such as simple
> > Watson-Crick base pairs or noncanonical base pairs (A-G, G-U, etc).
> > Some higher order structures, like triple-helices (A:U:U) and quaternary
> > helices (G:G:G:G) can't be accounted for.  Also, the parenthetical
> > syntax gets a bit confusing for very large sequences (16s rRNA, for
> > instance).
> >
> 
> Yep - tough in a single line.  We've also been thinking about how to
> mark these up in alignments of RNAs in Rfam, but without decision.
> You might think of things which aren't 1:1 as tertiary interactions
> and therefore seperable from the secondary structure which the bracket
> notation is designed to cope with.
> 
> > I think that the format all really depends on the program and the
> > particular use.
> <snip>
> > After all this babbling, I do think that RNAML is the way to go with
> > this.
> 
> These two seem contradictory to me :)

A bit contradictory, yes.  I also tend to write as a stream of thought,
so I sometimes change my mind.  

I'm a bit confused as how to approach the original issue (tagging the
structure in some way for a plugin).  This is b/c there doesn't seem to
be a consensus yet on an approach to retain as much structural
information as possible.  RNAML seems to be the best way so far (and the
list of people on board is pretty impressive), but it's a bit complex.  

Personally, I think the best way to approach the problem of having
multiple formats is the same approach used by Bio::SeqIO.  That is, by
using specific parsers for getting all information into a
Bioperl-specific format or a format in which information was retained at
the highest possible level (RNAML, INFERNAL, etc).  That way, data could
be converted into alternative formats which may or may not retain higher
level information depending on the input and output formats.  With this
approach, one could have a file parser for each format (INFERNAL, CT,
RNAML, etc) and output would be the same,  possibly with warnings for
loss of information (RNAML structure format to CT, for instance).  I may
have to delve into Bio::SeqIO a bit to get an idea of how they handle
things.

I guess the real issue is coming up with a way to deal with all levels
of information (secondary, tertiary, etc).  Maybe a modified CT format,
something like a table consisting of bases and their interactions with
other bases?  Maybe with a tagged designation for pairs signifying if
they are in non-WC pairs, triplets?  

Anyway as for now, I plan on just getting my motif results into a simple
parenthetical format which I'll parse in a program outside of
RNAmotif.pm.  I could always change it to another format later, when
some of the formatting issues are resolved.

> I don't kow much about RNAML but I get the impression its trying to
> solve all RNA sequence/markup/annotation issues in one go.  Depending
> on your point of view this is either a great idea or very bad.  I
> haven't decided yet :)

Yeah, but they've got some pretty big people in the field backing it up!
 
:>

Chris

> Sam
> 
> 
> --------------------------------------------------------------------
> Sam Griffiths-Jones                              sgj at sanger.ac.uk
> http://www.sanger.ac.uk/Users/sgj                +44 (0)1223 834244
> 
> Wisdom #8002: Always try to do things in chronological order;
> it's less confusing that way.
> --------------------------------------------------------------------
-- 
Christopher Fields
Lab of Dr. Robert Switzer
Dept. of Biochemistry
University of Illinois at Urbana-Champaign



More information about the Bioperl-l mailing list