[Bioperl-l] RNA fold

Tue Dec 9 12:00:54 EST 2003

On Tue, 2003-12-09 at 09:17, Jason Stajich wrote:
> On Tue, 9 Dec 2003, Stephen Baird wrote:
> 
> > Dear hardworking guys,
> >    Sorry....but I am a little worried about the <<...>>>> format having
> > trouble with pseudoknots and non-canonical base pairing...something that
> > happens more often than is apparent by programs that predict RNA
> > basepairing based on thermodynamics like MFOLD and the like.   RNAmotifs
> > which is doing  pattern searching can accomodate all the weird things that
> > might happen in a RNA structure, it is up to the user who designs the
> > pattern.
> >    Mapping a simple < or > or . to each nucleotide might not be
> > enough to work all the time.  Is there a way to store to a base the
> > specific nucleotide that  it is basepairing to in a structural field? This
> > would allow non-canonical basepairing and pseudoknots.
> > There is a new RNA structure XML file format which is suppose to be a new
> > standard...RNAML http://www-lbit.iro.umontreal.ca/rnaml/.... which  will
> > store the secondary  and tertiary structural data.  As RNA prediction and
> > analysis develops  more and more data will need to be added that is not
> > just the basepairing of canonical bases.
> >
> Good point.
> 
> Had heard that an XML format was on the way - this seems more intelligent
> system for storage without information loss - but of course it won't fit
> into the simple GFF system that Chris was thinking about.  Probably means
> Chris would want to use GFF to store the representation of the
> genomic location of the RNAs but a separate CGI type script will do all
> the heavy lifting of getting an ID, looking up the structure
> representation, and generating the plots/summary info/etc.

Aha!  This seems like a good idea!  Maybe use the tag for storing a
database location (ID), then using the CGI script to pull it out, set up
the plot, etc.  Nice, and shouldn't be too hard (although I could be
kicking myself later for saying that...)

> We really have no objects for RNA struture in Bioperl at this point so
> pretty much a blank slate for someone to exert their will...

I think RNAML is the way to go (as I told Stephen previously).  It would
be nice to get an RNAML object going...maybe Bio::SeqFeature::RNAML? 
Bio::Tools::RNAML?  Bio::Tools::Run::rnatools::RNAML? (that's a
mouthful....)

> I would much rather see us move up the sophistication ladder here, but
> someone new has to be willing to take it on as a project.
> 
> The afforementioned hard working guys will do our best to help in any way
> possible with design/programming issues but can't drive this beast.

I have to admit that I'm still somewhat of a newbie, though I have
picked up quite a bit from reading and, of course, using the Camel and
Llama books (plus Conway's OO Perl and Schwartz's Learning with Perl
Objects and References).  I'm a RNA researcher at heart and have been
programming for ~1 year off and on, mainly out of an interest in Perl
but also for research as a postdoc.  I would like to help out in this
area, but I am also constrained by "wet-bench" research as well.  For my
part I'll definitely do what I can.

On the plus side, I would be able to test on three different platforms
(Mac OS X, Fedora Core 1 Linux, and Windows XP)! 

I'll read up on RNAML to see what can be done.  I'll also look at the
Bio::Tools::Run::PiseApplication::mfold in bioperl-run and the perl
scripts in the Vienna package to see how output is processed for those
programs.

Chris 

> -jason
> > > > > issues (I have it but can't redistribute it due to this; the web
> > > > > interfaces available have some limitations which I couldn't do
> > > > > without).  RNAFold doesn't have these problems and the source code is
> > > > > available on the web, plus (like Jason pointed out) there are perl
> > > > > interfaces.  There is also something in the book Genomic Perl on
> > > > > calculating energies and drawing secondary structures, but I haven't
> > > > > checked it out in detail.
> > > > >
> > > > > Personally, I am working on a bioperl parser for the RNAmotif program
> > > > > suite (used to search for conserved secondary structures based on a
> > > > > descriptor).  The rnamotif program is able to pass the motif hits to
> > > > > efn or efn2 for calculating free energy (based on different energy
> > > > > rules) and can output CT format files.  I'm also thinking about doing
> > > > > something similar for tRNAscan-SE and ERPIN at some point.  The problem
> > > > > I'm running into is how to store the secondary structure output for
> > > > > inclusion into GFF databases (I'm currently using
> > > > > Bio::SeqFeature::Generic for storing features).  Anyone?
> > > >
> > > > Chris - I assume the structure is represented as string like
> > > > <<<...>>>> or ((((...)))) ?
> > > > If you do
> > > > $feat->add_tag_value('secondary_structure',$str);
> > > >
> > > > This should store okay in a DB::GFF db or is that not really working for
> > > > you?
> > >
> > > I think that would work.  I will have to do some fiddling with the
> > > program output to get it into that format.  One problem is taht RNAmotif
> > > allows mismatches in some of the segments.
> > >
> > > RNAmotif's raw output is a bit like FASTA.  Here's a bit from one of my
> > > analyses (the PyrR mRNA-binding site in Bacillus subtilis, rub from the
> > > Genbank file):
> > >
> > > #RM scored
> > > #RM descr h5(tag='H1') ss(tag='S1') h5(tag='H2') h5(tag='H2t')
> > > ss(tag='S2') h3(tag='H2t') h3(tag='H2') ss(tag='S3') h3(tag='H1')
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -6.300 0 1617567   35 attctt taaaa
> > > cagt c cagaga g gctg ag aaggat
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -8.000 0 1617567   35 attcttt aaaa
> > > cagt c cagaga g gctg a gaaggat
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -5.200 0 1617568   33 ttctt taaaa
> > > cagt c cagaga g gctg ag aagga
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -6.900 0 1617568   33 ttcttt aaaa
> > > cagt c cagaga g gctg a gaagga
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -0.400 0 1617568   32 ttcttt aaaa
> > > cagt c cagaga g gctg . agaagg
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -7.200 0 1617569   32 tcttt aaaa
> > > cagt c cagaga g gctg ag aagga
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -3.900 0 1617569   31 tctt taaaa
> > > cagt c cagaga g gctg ag aagg
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -5.600 0 1617569   31 tcttt aaaa
> > > cagt c cagaga g gctg a gaagg
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -4.800 0 1617570   30 cttt aaaa
> > > cagt c cagaga g gctg ag aagg
> > > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > > complete genome.
> > > gi|16077068|gb|NC_000964|NC_000964   -4.100 0 1617570   29 cttt aaaa
> > > cagt c cagaga g gctg a gaag
> > >
> > > ....
> > >
> > >
> > > The first two lines (marked with ##) are the initialization line and a
> > > bit from the descriptor file (describing the secondary structural
> > > characteristics).  The different segments of the structure are given a
> > > designation (ss=single stranded, etc) and a tag (any name, although I
> > > use simple ones).  The tags help when describing more complex structures
> > > by allowing for pairing between distant sites and higher level
> > > interactions (pseudoknots and tertiary and quaternary structures,
> > > although I haven't needed these).  The output is like fasta, but the
> > > sequence data is replaced by the database hit (usually the acc. #),
> > > score (in this case, free energy), strand of hit, start of hit, length
> > > of hit and the sequence itself, broken up into segments matching the
> > > elements in the descriptor.  This is where the trouble lies; as RNAmotif
> > > allows for mismatches in the descriptor (to allow for internal bulges),
> > > the parser for the sequence elements will need to be intelligent enough
> > > to pick this out.
> > >
> > > Also note that the data hits are redundant (they are retained b/c they
> > > fall below a predetermined threshold from the calculated free energy,
> > > determined in the descriptor file.  I plan on including a parser to
> > > clean this up (retain the best score of a fold located within a certain
> > > sequence range, probably less than 10 bp).  There's a program in the
> > > RNAmotif suite to do this (rmprune), but it doesn't always "prune" to
> > > the best sequence hit.
> > >
> > > > There are some newish bioperl objects Seq::Meta which are for representing
> > > > some bit of information about each base - maybe this is the place RNA or
> > > > Protein secondary structure information can be coded.
> > > > I'm not sure of what is best way to store these data - Heikki and others
> > > > have mostly worked on them so I can only hand wave at this point.
> > > >
> > > >
> > > > I'm not sure what type of computing you want to do on the data, depending
> > > > on what you want to do, might dictate creating/using different objects.
> > > > i.e. if you wanted to get the residues of the stems I think you might want
> > > > to build a special object which can represent the pairing after parsing it
> > > > out of the structure string.
> > >
> > > My main use for this is to map these database hits against the sequence
> > > using Gbrowse.  I would like to add a Gbrowse plugin to link to some
> > > sort of secondary structure output, maybe from the Vienna package to
> > > represent the secondary structure (if using the parenthetical
> > > notation).  I can also get CT format output from another program in the
> > > RNAmotif suite (rm2ct), so changing formats shouldn't be too hard but
> > > does require passing the output file through rm2ct.  My main concern is
> > > getting the data into some format that could retain structural
> > > information that would prevent informational loss.
> > >
> > > > -jason
> > > >
> > > > >
> > > > > Chris Fields
> > > > > Postdoctoral Reseacher - Dept. of Biochemistry
> > > > > University of Illinois at Urbana-Champaign
> > > > >
> > > > > On Dec 5, 2003, at 2:22 PM, Vesko Baev wrote:
> > > > >
> > > > > > Hi to all,
> > > > > > if anyone knows a module or external program (which can be linked to
> > > > > > bioperl) for folding a RNA predicting hairpins and calculating a free
> > > > > > energy?
> > > > > >
> > > > > > Thanks to ALL!
> > > > > >
> > > > > > Vesselin Baev
> > > > > > Bulgaria
> > > > > >
> > > > > > -----------------------------------------------------------------
> > > > > > http://www.pari.bg - ç¯§Ñ§Þ§Ú§Ñ§ Ý§ ç±§á§²  Ó§Ö§Ü§ Õ§Ö§ ?
> > > > > > ç¯§ Ñ§Ù§Ú§Ñ§Û§ ß§ Ü§Þ§Ö§!
> > > > > > ç¡§Ò§ß§Ú§Ñ§Û§ !
> > > > > > _______________________________________________
> > > > > > Bioperl-l mailing list
> > > > > > Bioperl-l at portal.open-bio.org
> > > > > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > > >
> > > >
> > > > --
> > > > Jason Stajich
> > > > Duke University
> > > > jason at cgt.mc.duke.edu
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l at portal.open-bio.org
> > > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > --
> > > Christopher Fields
> > > Lab of Dr. Robert Switzer
> > > Dept. of Biochemistry
> > > University of Illinois at Urbana-Champaign
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > >
> >
> >
> >
> 
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
-- 
Christopher Fields
Lab of Dr. Robert Switzer
Dept. of Biochemistry
University of Illinois at Urbana-Champaign