[Bioperl-l] RNA fold

Mon Dec 8 16:12:20 EST 2003

On Mon, 2003-12-08 at 12:06, Jason Stajich wrote:
> On Sat, 6 Dec 2003, Chris Fields wrote:
> 
> > I think, like the rest, that RNAFold may be the easiest way to go.
> > mfold is a free program but distribution is bound up by licensing
> > issues (I have it but can't redistribute it due to this; the web
> > interfaces available have some limitations which I couldn't do
> > without).  RNAFold doesn't have these problems and the source code is
> > available on the web, plus (like Jason pointed out) there are perl
> > interfaces.  There is also something in the book Genomic Perl on
> > calculating energies and drawing secondary structures, but I haven't
> > checked it out in detail.
> >
> > Personally, I am working on a bioperl parser for the RNAmotif program
> > suite (used to search for conserved secondary structures based on a
> > descriptor).  The rnamotif program is able to pass the motif hits to
> > efn or efn2 for calculating free energy (based on different energy
> > rules) and can output CT format files.  I'm also thinking about doing
> > something similar for tRNAscan-SE and ERPIN at some point.  The problem
> > I'm running into is how to store the secondary structure output for
> > inclusion into GFF databases (I'm currently using
> > Bio::SeqFeature::Generic for storing features).  Anyone?
> 
> Chris - I assume the structure is represented as string like
> <<<...>>>> or ((((...)))) ?
> If you do
> $feat->add_tag_value('secondary_structure',$str);
> 
> This should store okay in a DB::GFF db or is that not really working for
> you?

I think that would work.  I will have to do some fiddling with the
program output to get it into that format.  One problem is taht RNAmotif
allows mismatches in some of the segments.  

RNAmotif's raw output is a bit like FASTA.  Here's a bit from one of my
analyses (the PyrR mRNA-binding site in Bacillus subtilis, rub from the
Genbank file):

#RM scored
#RM descr h5(tag='H1') ss(tag='S1') h5(tag='H2') h5(tag='H2t')
ss(tag='S2') h3(tag='H2t') h3(tag='H2') ss(tag='S3') h3(tag='H1')
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -6.300 0 1617567   35 attctt taaaa
cagt c cagaga g gctg ag aaggat
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -8.000 0 1617567   35 attcttt aaaa
cagt c cagaga g gctg a gaaggat
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -5.200 0 1617568   33 ttctt taaaa
cagt c cagaga g gctg ag aagga
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -6.900 0 1617568   33 ttcttt aaaa
cagt c cagaga g gctg a gaagga
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -0.400 0 1617568   32 ttcttt aaaa
cagt c cagaga g gctg . agaagg
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -7.200 0 1617569   32 tcttt aaaa
cagt c cagaga g gctg ag aagga
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -3.900 0 1617569   31 tctt taaaa
cagt c cagaga g gctg ag aagg
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -5.600 0 1617569   31 tcttt aaaa
cagt c cagaga g gctg a gaagg
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -4.800 0 1617570   30 cttt aaaa
cagt c cagaga g gctg ag aagg
>gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
complete genome.
gi|16077068|gb|NC_000964|NC_000964   -4.100 0 1617570   29 cttt aaaa
cagt c cagaga g gctg a gaag

....

The first two lines (marked with ##) are the initialization line and a
bit from the descriptor file (describing the secondary structural
characteristics).  The different segments of the structure are given a
designation (ss=single stranded, etc) and a tag (any name, although I
use simple ones).  The tags help when describing more complex structures
by allowing for pairing between distant sites and higher level
interactions (pseudoknots and tertiary and quaternary structures,
although I haven't needed these).  The output is like fasta, but the
sequence data is replaced by the database hit (usually the acc. #),
score (in this case, free energy), strand of hit, start of hit, length
of hit and the sequence itself, broken up into segments matching the
elements in the descriptor.  This is where the trouble lies; as RNAmotif
allows for mismatches in the descriptor (to allow for internal bulges),
the parser for the sequence elements will need to be intelligent enough
to pick this out.  

Also note that the data hits are redundant (they are retained b/c they
fall below a predetermined threshold from the calculated free energy,
determined in the descriptor file.  I plan on including a parser to
clean this up (retain the best score of a fold located within a certain
sequence range, probably less than 10 bp).  There's a program in the
RNAmotif suite to do this (rmprune), but it doesn't always "prune" to
the best sequence hit.

> There are some newish bioperl objects Seq::Meta which are for representing
> some bit of information about each base - maybe this is the place RNA or
> Protein secondary structure information can be coded.
> I'm not sure of what is best way to store these data - Heikki and others
> have mostly worked on them so I can only hand wave at this point.
> 
> 
> I'm not sure what type of computing you want to do on the data, depending
> on what you want to do, might dictate creating/using different objects.
> i.e. if you wanted to get the residues of the stems I think you might want
> to build a special object which can represent the pairing after parsing it
> out of the structure string.

My main use for this is to map these database hits against the sequence
using Gbrowse.  I would like to add a Gbrowse plugin to link to some
sort of secondary structure output, maybe from the Vienna package to
represent the secondary structure (if using the parenthetical
notation).  I can also get CT format output from another program in the
RNAmotif suite (rm2ct), so changing formats shouldn't be too hard but
does require passing the output file through rm2ct.  My main concern is
getting the data into some format that could retain structural
information that would prevent informational loss.

> -jason
> 
> >
> > Chris Fields
> > Postdoctoral Reseacher - Dept. of Biochemistry
> > University of Illinois at Urbana-Champaign
> >
> > On Dec 5, 2003, at 2:22 PM, Vesko Baev wrote:
> >
> > > Hi to all,
> > > if anyone knows a module or external program (which can be linked to
> > > bioperl) for folding a RNA predicting hairpins and calculating a free
> > > energy?
> > >
> > > Thanks to ALL!
> > >
> > > Vesselin Baev
> > > Bulgaria
> > >
> > > -----------------------------------------------------------------
> > > http://www.pari.bg - ç¯§Ñ§Þ§Ú§Ñ§ Ý§ ç±§á§²  Ó§Ö§Ü§ Õ§Ö§ ?
> > > ç¯§ Ñ§Ù§Ú§Ñ§Û§ ß§ Ü§Þ§Ö§!
> > > ç¡§Ò§ß§Ú§Ñ§Û§ !
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
-- 
Christopher Fields
Lab of Dr. Robert Switzer
Dept. of Biochemistry
University of Illinois at Urbana-Champaign