[Bioperl-l] SWISS-PROT writing

Kris Boulez krbou@pgsgent.be
Wed, 3 Jan 2001 08:29:43 +0100


Quoting Hilmar Lapp (lapp@gnf.org):
> Kris Boulez wrote:
> > 
> > 
> > - at line 356 there is
> >    $mol = $seq->molecule;
> > I think this should be $seq->moltype; as ->molecule only looks for
> > {'molecule'} which is not set by ->new. Bio::Seq->new only sets
> > {'moltype'}.
> > We should change the 'protein' of ->moltype to 'PRT' to conform to the
> > standard.
> 
> moltype() is internal to BioPerl. Whenever there is an attribute synonymous
> to moltype() but defined by a databank, molecule() should be used for that.
> So the code is correct I think.
> 
Then documentation for Bio::Seq->molecule() should be extended a bit. It
now reads

       molecule


        Title   : molecule
        Usage   : $obj->molecule($newval)
        Function:
        Returns : type of molecule (DNA, mRNA)
        Args    : newvalue (optional)


> Bio::Seq->new() indeed only sets moltype(), because at this point there is
> no databank specificity. molecule() should be set by the parser. If you
> want to instantiate a swissprot seq from memory and have it written in
> swissprot format, the way we want to go is have dedicated classes under
> Bio::Seq::*. If there is need for a swissprot-dedicated class, that one
> probably would also set molecule() at instantiation time.
> 
> > 
> > B.T.W. do we want to allow SWISS-PROT to try to write out DNA/RNA
> > sequences ?
> 
> In my opinion there's no need for that, but others may think differently.
> 
> > 
> > - around line 369 the whole else block should be changed. We should make
> >   sure we have a division ($div) in the ID part. The previous version of
> > the code which is now commented out did a better try at this. Looking at
> > next_seq() we why we're not able to read this (entry name must contain
> > an underscore section 3.1.1 of the SWISS-PROT manual).
> > 
> >     $line =~ /^ID\s+([^\s_]+)_([^\s_]+)\s+([^\s;]+);\s+([^\s;]+);/
> >      || $self->throw("swissprot stream with no ID. Not swissprot in my
> > book");
> >    $name = $1."_".$2;
> >    $seq->primary_id($1);
> >    $seq->division($2);
> > 
> 
> If this is the code you're referring to (sorry, don't have at hand right
> now), it does ensure that there is a division part. I'm probably missing
> something.
> 
Sorry I wasn't clear on this one obviously. The code I pasted is from
next_seq(). What I was referring to is the code in write_seq(). In there
we do not enforce that there is a division part (I think we should at
least check if $seq->display_id() returns an underscore in a reasonable
position). The code reads

   } else {
       #$temp_line = sprintf ("%10s     STANDARD;      %3s;   %d AA.",
       #                     $seq->primary_id()."_".$div,$mol,$len);
       # Reconstructing the ID relies heavily upon the input source
       # having
       # been in a format that is parsed as this routine expects it --
       # that is,
       # by this module itself. This is bad, I think, and immediately
       # breaks
       # if e.g. the Bio::DB::GenPept module is used as input.
       # Hence, switch to display_id(); _every_ sequence is supposed to
       # have
       # this. HL 2000/09/03
       $temp_line = sprintf ("%10s     STANDARD;      %3s;   %d AA.",
                             $seq->display_id(), $mol, $len);
   }


> > How standard compliant do we want to be with this. If we want to be very
> > strict we should e.g. make sure the 'entry name' (first item on the ID
> > line) is not more then 10 characters.
> > 
> > P.S. (very) minor issue: the division we choose 'UNK' for sequences
> > which don't have a division set is not in the standard (speclist.txt),
> > it only contains UNKP
> > 
> 
> Sure, can (should) be changed.
> 
> > Should I try to adopt swiss.pm to the thoughts I (tried to) put out or
> > are there major objections ?
> >
> 
> See above. I'm not sure what we already have in the Bio::Seq::* hierarchy.
> If there's no Swiss.pm yet and GenBank/GenPept doesn't fit well, you could
> give Bio::Seq::Swiss.pm a start and adopt the parser to instantiate objects
> of that class.
> 
The only thing we have now is Bio::Seq::LargeSeq en LargePrimarySeq. Do
you plan on having a Bio::Seq::* class for every (complex) sequence type ?

Kris,