[Bioperl-l] How to check mRNA moltype?

Heikki Lehvaslaiho heikki@ebi.ac.uk
Wed, 17 Oct 2001 15:36:17 +0100


Ewan Birney wrote:
> 
> On Wed, 17 Oct 2001, Heikki Lehvaslaiho wrote:
> 
> > Henry,
> >
> > moltype() is the correct method. It works on EMBL sequences so it should on
> > GenBank, too. I had a look at the code and RNA option was completely
> > ignored.
> > I rewrote the lines doing it and committed them to bioperl-live. I hope you
> > are using CVS and can see the changes immediately.
> >
> > This fix should go into 07 branch, too, but first I'd like to get feedback
> > that the fix really works on larger data sets. I tried it only on t/data
> > examples.
> > Does anyone have a system handy to try this? The potential problem is that
> > it is  DNA/mRNA/... keyword is allowed to be missing completely from the
> > LOCUS line.
> 
> I don't like this. Although genbank says "mRNA" it doesn't *use* RNA
> characters. I'd prefer the mRNA to be somewhere in teh annotation whereas
> moltype means "alphabet".
> 
> It is a "where do you put this information" question. Any other opinions?

Ewan,

In bioperl we have stayed clear of forcing strict alphabets. Do you want to
change that completely? We might have at least partially go that way, for
CORBA compatibility.

So far the custom has been use RNA/DNA alphabets interchangeably and simply
make sure that relevent methods know how to handle both. If someone wants to
translate a sequence with moltype 'dna', bioperl let's that happen. Isn't
that one of the guiding principles of perl: not to let formality to stand in
the way of efficiency?

If we want to keep all the information there is in the LOCUS line, there is
a lot more in there. To start with, all the options for the source molecule
('topology of molecule sequenced') are (the new standard):

45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
           ms- (mixed-stranded)
48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
           mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
           snoRNA. Left justified.



EMBL docs state:

3.4.1 
Molecule Type: The third item on the line is the type of molecule as stored,
which  at  present can be either 'DNA', 'RNA' (see the
comment in Section 2.1 about cDNA) or 'XXX' for unknown molecule type.

 ... and ...

2.1.
The sequences are presented in the database in a form corresponding to the
biological state of the information in vivo. Thus, cDNA
sequences are stored in the database as RNA sequences, even though they
usually appear in the literature as DNA.

http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-id+2ke6D1HP2PN+-e+[EMBL:'AB017977']
--

Actually, I do not think the above statement holds any more. Check any RNA
sequence and you will not find U characters.

I do not have time check it, but I think the feature table source key holds
the information of the 'topology of molecule sequenced' in all sequence
databses ( DDBJ/EMBL/GenBank ).


	-Heikki


-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________