[Bioperl-l] RNA fold

Chris Fields cjfields at uiuc.edu
Tue Dec 9 11:29:14 EST 2003


I think that you can use parenthetical formats for pseudoknot-like
structures (improperly nested Watson-Crick helices).  The idea is that
() would represent secondary structure, and other brackets {}[] would
represent higher-order structures, like so:

     Helix         Pseudoknot           
______________       _______________
|            |       |             |
(((((....))))).[[[...((((..]]]..))))
                |___________|
                 

Of course this is where the problem lies, b/c all structures in this
format are constrained to simple 1:1 base associations, such as simple
Watson-Crick base pairs or noncanonical base pairs (A-G, G-U, etc). 
Some higher order structures, like triple-helices (A:U:U) and quaternary
helices (G:G:G:G) can't be accounted for.  Also, the parenthetical
syntax gets a bit confusing for very large sequences (16s rRNA, for
instance).  

I think that the format all really depends on the program and the
particular use.  Our motif is very simple (two consecutive helices
containing consensus sequences with purine-rich internal bulge and an
embedded GNRA tetraloop).  mfold and the Vienna package look for
secondary structure only with the constraint that all potential 3'
helices must be paired to the closest potential 5' helices, a rule which
pseudoknots break.  These programs do not search for a specific motif,
however; they mainly look for the most optimal folds for a sequence
under various conditions (salt cond., temp, etc) given the current
energy rules (Turner's rules). 

RNAmotif is a program which uses a descriptor of the sequence to search
for motifs (through a modified regex engine, or as T. Macke says, "a
pattern language which represents helices and structural motifs"), then
uses an arbitrary scoring system (based on Turner's free energy rules,
sequence comparisons, and other features) to "weed out" sequences that
don't constrain (awk-like, according to the manual).  In other words, it
is looking for a pattern which matches a hypothetical descriptor
element, then scores it afterwards.  Other programs (ERPIN, infernal)
use consensus searches built from alignments of known structural
motifs.  In other words, the motif is known ahead of time, likely
determined through structural studies.  

After all this babbling, I do think that RNAML is the way to go with
this.  However, when it comes to adding a tag for a SeqFeature object,
using RNAML becomes very problematic (not to mention an RNAML
parser/converter would need to be written!).  I'll think about it some
more.

Chris   

On Tue, 2003-12-09 at 08:45, Stephen Baird wrote:
> Dear hardworking guys,
>    Sorry....but I am a little worried about the <<...>>>> format having
> trouble with pseudoknots and non-canonical base pairing...something that
> happens more often than is apparent by programs that predict RNA
> basepairing based on thermodynamics like MFOLD and the like.   RNAmotifs
> which is doing  pattern searching can accomodate all the weird things that
> might happen in a RNA structure, it is up to the user who designs the
> pattern.
>    Mapping a simple < or > or . to each nucleotide might not be
> enough to work all the time.  Is there a way to store to a base the
> specific nucleotide that  it is basepairing to in a structural field? This
> would allow non-canonical basepairing and pseudoknots.
> There is a new RNA structure XML file format which is suppose to be a new
> standard...RNAML http://www-lbit.iro.umontreal.ca/rnaml/.... which  will
> store the secondary  and tertiary structural data.  As RNA prediction and
> analysis develops  more and more data will need to be added that is not
> just the basepairing of canonical bases.
> 
> 
> Stephen Baird
> Molecular Genetics
> Children's Hospital of Eastern Ontario
> Ottawa, Ontario
> Canada
> 
> > On Mon, 2003-12-08 at 12:06, Jason Stajich wrote:
> > > On Sat, 6 Dec 2003, Chris Fields wrote:
> > >
> > > > I think, like the rest, that RNAFold may be the easiest way to go.
> > > > mfold is a free program but distribution is bound up by licensing
> > > > issues (I have it but can't redistribute it due to this; the web
> > > > interfaces available have some limitations which I couldn't do
> > > > without).  RNAFold doesn't have these problems and the source code is
> > > > available on the web, plus (like Jason pointed out) there are perl
> > > > interfaces.  There is also something in the book Genomic Perl on
> > > > calculating energies and drawing secondary structures, but I haven't
> > > > checked it out in detail.
> > > >
> > > > Personally, I am working on a bioperl parser for the RNAmotif program
> > > > suite (used to search for conserved secondary structures based on a
> > > > descriptor).  The rnamotif program is able to pass the motif hits to
> > > > efn or efn2 for calculating free energy (based on different energy
> > > > rules) and can output CT format files.  I'm also thinking about doing
> > > > something similar for tRNAscan-SE and ERPIN at some point.  The problem
> > > > I'm running into is how to store the secondary structure output for
> > > > inclusion into GFF databases (I'm currently using
> > > > Bio::SeqFeature::Generic for storing features).  Anyone?
> > >
> > > Chris - I assume the structure is represented as string like
> > > <<<...>>>> or ((((...)))) ?
> > > If you do
> > > $feat->add_tag_value('secondary_structure',$str);
> > >
> > > This should store okay in a DB::GFF db or is that not really working for
> > > you?
> >
> > I think that would work.  I will have to do some fiddling with the
> > program output to get it into that format.  One problem is taht RNAmotif
> > allows mismatches in some of the segments.
> >
> > RNAmotif's raw output is a bit like FASTA.  Here's a bit from one of my
> > analyses (the PyrR mRNA-binding site in Bacillus subtilis, rub from the
> > Genbank file):
> >
> > #RM scored
> > #RM descr h5(tag='H1') ss(tag='S1') h5(tag='H2') h5(tag='H2t')
> > ss(tag='S2') h3(tag='H2t') h3(tag='H2') ss(tag='S3') h3(tag='H1')
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -6.300 0 1617567   35 attctt taaaa
> > cagt c cagaga g gctg ag aaggat
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -8.000 0 1617567   35 attcttt aaaa
> > cagt c cagaga g gctg a gaaggat
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -5.200 0 1617568   33 ttctt taaaa
> > cagt c cagaga g gctg ag aagga
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -6.900 0 1617568   33 ttcttt aaaa
> > cagt c cagaga g gctg a gaagga
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -0.400 0 1617568   32 ttcttt aaaa
> > cagt c cagaga g gctg . agaagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -7.200 0 1617569   32 tcttt aaaa
> > cagt c cagaga g gctg ag aagga
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -3.900 0 1617569   31 tctt taaaa
> > cagt c cagaga g gctg ag aagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -5.600 0 1617569   31 tcttt aaaa
> > cagt c cagaga g gctg a gaagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -4.800 0 1617570   30 cttt aaaa
> > cagt c cagaga g gctg ag aagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -4.100 0 1617570   29 cttt aaaa
> > cagt c cagaga g gctg a gaag
> >
> > ....
> >
> >
> > The first two lines (marked with ##) are the initialization line and a
> > bit from the descriptor file (describing the secondary structural
> > characteristics).  The different segments of the structure are given a
> > designation (ss=single stranded, etc) and a tag (any name, although I
> > use simple ones).  The tags help when describing more complex structures
> > by allowing for pairing between distant sites and higher level
> > interactions (pseudoknots and tertiary and quaternary structures,
> > although I haven't needed these).  The output is like fasta, but the
> > sequence data is replaced by the database hit (usually the acc. #),
> > score (in this case, free energy), strand of hit, start of hit, length
> > of hit and the sequence itself, broken up into segments matching the
> > elements in the descriptor.  This is where the trouble lies; as RNAmotif
> > allows for mismatches in the descriptor (to allow for internal bulges),
> > the parser for the sequence elements will need to be intelligent enough
> > to pick this out.
> >
> > Also note that the data hits are redundant (they are retained b/c they
> > fall below a predetermined threshold from the calculated free energy,
> > determined in the descriptor file.  I plan on including a parser to
> > clean this up (retain the best score of a fold located within a certain
> > sequence range, probably less than 10 bp).  There's a program in the
> > RNAmotif suite to do this (rmprune), but it doesn't always "prune" to
> > the best sequence hit.
> >
> > > There are some newish bioperl objects Seq::Meta which are for representing
> > > some bit of information about each base - maybe this is the place RNA or
> > > Protein secondary structure information can be coded.
> > > I'm not sure of what is best way to store these data - Heikki and others
> > > have mostly worked on them so I can only hand wave at this point.
> > >
> > >
> > > I'm not sure what type of computing you want to do on the data, depending
> > > on what you want to do, might dictate creating/using different objects.
> > > i.e. if you wanted to get the residues of the stems I think you might want
> > > to build a special object which can represent the pairing after parsing it
> > > out of the structure string.
> >
> > My main use for this is to map these database hits against the sequence
> > using Gbrowse.  I would like to add a Gbrowse plugin to link to some
> > sort of secondary structure output, maybe from the Vienna package to
> > represent the secondary structure (if using the parenthetical
> > notation).  I can also get CT format output from another program in the
> > RNAmotif suite (rm2ct), so changing formats shouldn't be too hard but
> > does require passing the output file through rm2ct.  My main concern is
> > getting the data into some format that could retain structural
> > information that would prevent informational loss.
> >
> > > -jason
> > >
> > > >
> > > > Chris Fields
> > > > Postdoctoral Reseacher - Dept. of Biochemistry
> > > > University of Illinois at Urbana-Champaign
> > > >
> > > > On Dec 5, 2003, at 2:22 PM, Vesko Baev wrote:
> > > >
> > > > > Hi to all,
> > > > > if anyone knows a module or external program (which can be linked to
> > > > > bioperl) for folding a RNA predicting hairpins and calculating a free
> > > > > energy?
> > > > >
> > > > > Thanks to ALL!
> > > > >
> > > > > Vesselin Baev
> > > > > Bulgaria
> > > > >
> > > > > -----------------------------------------------------------------
> > > > > http://www.pari.bg - 篧ѧާڧѧ ݧ 籧᧲  ӧ֧ܧ է֧ ?
> > > > > 篧 ѧ٧ڧѧۧ ߧ ܧާ֧!
> > > > > 硧ҧߧڧѧۧ !
> > > > > _______________________________________________
> > > > > Bioperl-l mailing list
> > > > > Bioperl-l at portal.open-bio.org
> > > > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > >
> > >
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason at cgt.mc.duke.edu
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > --
> > Christopher Fields
> > Lab of Dr. Robert Switzer
> > Dept. of Biochemistry
> > University of Illinois at Urbana-Champaign
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
-- 
Christopher Fields
Lab of Dr. Robert Switzer
Dept. of Biochemistry
University of Illinois at Urbana-Champaign



More information about the Bioperl-l mailing list