[Bioperl-l] RNA fold

Tue Dec 9 10:17:44 EST 2003

On Tue, 9 Dec 2003, Stephen Baird wrote:

> Dear hardworking guys,
>    Sorry....but I am a little worried about the <<...>>>> format having
> trouble with pseudoknots and non-canonical base pairing...something that
> happens more often than is apparent by programs that predict RNA
> basepairing based on thermodynamics like MFOLD and the like.   RNAmotifs
> which is doing  pattern searching can accomodate all the weird things that
> might happen in a RNA structure, it is up to the user who designs the
> pattern.
>    Mapping a simple < or > or . to each nucleotide might not be
> enough to work all the time.  Is there a way to store to a base the
> specific nucleotide that  it is basepairing to in a structural field? This
> would allow non-canonical basepairing and pseudoknots.
> There is a new RNA structure XML file format which is suppose to be a new
> standard...RNAML http://www-lbit.iro.umontreal.ca/rnaml/.... which  will
> store the secondary  and tertiary structural data.  As RNA prediction and
> analysis develops  more and more data will need to be added that is not
> just the basepairing of canonical bases.
>
Good point.

Had heard that an XML format was on the way - this seems more intelligent
system for storage without information loss - but of course it won't fit
into the simple GFF system that Chris was thinking about.  Probably means
Chris would want to use GFF to store the representation of the
genomic location of the RNAs but a separate CGI type script will do all
the heavy lifting of getting an ID, looking up the structure
representation, and generating the plots/summary info/etc.

We really have no objects for RNA struture in Bioperl at this point so
pretty much a blank slate for someone to exert their will...

I would much rather see us move up the sophistication ladder here, but
someone new has to be willing to take it on as a project.

The afforementioned hard working guys will do our best to help in any way
possible with design/programming issues but can't drive this beast.

-jason
>
> Stephen Baird
> Molecular Genetics
> Children's Hospital of Eastern Ontario
> Ottawa, Ontario
> Canada
>
> > On Mon, 2003-12-08 at 12:06, Jason Stajich wrote:
> > > On Sat, 6 Dec 2003, Chris Fields wrote:
> > >
> > > > I think, like the rest, that RNAFold may be the easiest way to go.
> > > > mfold is a free program but distribution is bound up by licensing
> > > > issues (I have it but can't redistribute it due to this; the web
> > > > interfaces available have some limitations which I couldn't do
> > > > without).  RNAFold doesn't have these problems and the source code is
> > > > available on the web, plus (like Jason pointed out) there are perl
> > > > interfaces.  There is also something in the book Genomic Perl on
> > > > calculating energies and drawing secondary structures, but I haven't
> > > > checked it out in detail.
> > > >
> > > > Personally, I am working on a bioperl parser for the RNAmotif program
> > > > suite (used to search for conserved secondary structures based on a
> > > > descriptor).  The rnamotif program is able to pass the motif hits to
> > > > efn or efn2 for calculating free energy (based on different energy
> > > > rules) and can output CT format files.  I'm also thinking about doing
> > > > something similar for tRNAscan-SE and ERPIN at some point.  The problem
> > > > I'm running into is how to store the secondary structure output for
> > > > inclusion into GFF databases (I'm currently using
> > > > Bio::SeqFeature::Generic for storing features).  Anyone?
> > >
> > > Chris - I assume the structure is represented as string like
> > > <<<...>>>> or ((((...)))) ?
> > > If you do
> > > $feat->add_tag_value('secondary_structure',$str);
> > >
> > > This should store okay in a DB::GFF db or is that not really working for
> > > you?
> >
> > I think that would work.  I will have to do some fiddling with the
> > program output to get it into that format.  One problem is taht RNAmotif
> > allows mismatches in some of the segments.
> >
> > RNAmotif's raw output is a bit like FASTA.  Here's a bit from one of my
> > analyses (the PyrR mRNA-binding site in Bacillus subtilis, rub from the
> > Genbank file):
> >
> > #RM scored
> > #RM descr h5(tag='H1') ss(tag='S1') h5(tag='H2') h5(tag='H2t')
> > ss(tag='S2') h3(tag='H2t') h3(tag='H2') ss(tag='S3') h3(tag='H1')
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -6.300 0 1617567   35 attctt taaaa
> > cagt c cagaga g gctg ag aaggat
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -8.000 0 1617567   35 attcttt aaaa
> > cagt c cagaga g gctg a gaaggat
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -5.200 0 1617568   33 ttctt taaaa
> > cagt c cagaga g gctg ag aagga
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -6.900 0 1617568   33 ttcttt aaaa
> > cagt c cagaga g gctg a gaagga
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -0.400 0 1617568   32 ttcttt aaaa
> > cagt c cagaga g gctg . agaagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -7.200 0 1617569   32 tcttt aaaa
> > cagt c cagaga g gctg ag aagga
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -3.900 0 1617569   31 tctt taaaa
> > cagt c cagaga g gctg ag aagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -5.600 0 1617569   31 tcttt aaaa
> > cagt c cagaga g gctg a gaagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -4.800 0 1617570   30 cttt aaaa
> > cagt c cagaga g gctg ag aagg
> > >gi|16077068|gb|NC_000964|NC_000964 DEFINITION  Bacillus subtilis,
> > complete genome.
> > gi|16077068|gb|NC_000964|NC_000964   -4.100 0 1617570   29 cttt aaaa
> > cagt c cagaga g gctg a gaag
> >
> > ....
> >
> >
> > The first two lines (marked with ##) are the initialization line and a
> > bit from the descriptor file (describing the secondary structural
> > characteristics).  The different segments of the structure are given a
> > designation (ss=single stranded, etc) and a tag (any name, although I
> > use simple ones).  The tags help when describing more complex structures
> > by allowing for pairing between distant sites and higher level
> > interactions (pseudoknots and tertiary and quaternary structures,
> > although I haven't needed these).  The output is like fasta, but the
> > sequence data is replaced by the database hit (usually the acc. #),
> > score (in this case, free energy), strand of hit, start of hit, length
> > of hit and the sequence itself, broken up into segments matching the
> > elements in the descriptor.  This is where the trouble lies; as RNAmotif
> > allows for mismatches in the descriptor (to allow for internal bulges),
> > the parser for the sequence elements will need to be intelligent enough
> > to pick this out.
> >
> > Also note that the data hits are redundant (they are retained b/c they
> > fall below a predetermined threshold from the calculated free energy,
> > determined in the descriptor file.  I plan on including a parser to
> > clean this up (retain the best score of a fold located within a certain
> > sequence range, probably less than 10 bp).  There's a program in the
> > RNAmotif suite to do this (rmprune), but it doesn't always "prune" to
> > the best sequence hit.
> >
> > > There are some newish bioperl objects Seq::Meta which are for representing
> > > some bit of information about each base - maybe this is the place RNA or
> > > Protein secondary structure information can be coded.
> > > I'm not sure of what is best way to store these data - Heikki and others
> > > have mostly worked on them so I can only hand wave at this point.
> > >
> > >
> > > I'm not sure what type of computing you want to do on the data, depending
> > > on what you want to do, might dictate creating/using different objects.
> > > i.e. if you wanted to get the residues of the stems I think you might want
> > > to build a special object which can represent the pairing after parsing it
> > > out of the structure string.
> >
> > My main use for this is to map these database hits against the sequence
> > using Gbrowse.  I would like to add a Gbrowse plugin to link to some
> > sort of secondary structure output, maybe from the Vienna package to
> > represent the secondary structure (if using the parenthetical
> > notation).  I can also get CT format output from another program in the
> > RNAmotif suite (rm2ct), so changing formats shouldn't be too hard but
> > does require passing the output file through rm2ct.  My main concern is
> > getting the data into some format that could retain structural
> > information that would prevent informational loss.
> >
> > > -jason
> > >
> > > >
> > > > Chris Fields
> > > > Postdoctoral Reseacher - Dept. of Biochemistry
> > > > University of Illinois at Urbana-Champaign
> > > >
> > > > On Dec 5, 2003, at 2:22 PM, Vesko Baev wrote:
> > > >
> > > > > Hi to all,
> > > > > if anyone knows a module or external program (which can be linked to
> > > > > bioperl) for folding a RNA predicting hairpins and calculating a free
> > > > > energy?
> > > > >
> > > > > Thanks to ALL!
> > > > >
> > > > > Vesselin Baev
> > > > > Bulgaria
> > > > >
> > > > > -----------------------------------------------------------------
> > > > > http://www.pari.bg - ç¯§Ñ§Þ§Ú§Ñ§ Ý§ ç±§á§²  Ó§Ö§Ü§ Õ§Ö§ ?
> > > > > ç¯§ Ñ§Ù§Ú§Ñ§Û§ ß§ Ü§Þ§Ö§!
> > > > > ç¡§Ò§ß§Ú§Ñ§Û§ !
> > > > > _______________________________________________
> > > > > Bioperl-l mailing list
> > > > > Bioperl-l at portal.open-bio.org
> > > > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > >
> > >
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason at cgt.mc.duke.edu
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > --
> > Christopher Fields
> > Lab of Dr. Robert Switzer
> > Dept. of Biochemistry
> > University of Illinois at Urbana-Champaign
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
>
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu